UrbanLoc: Language-Guided User Localization via Location-Aware Multimodal Retrieval

Anonymous Authors
Overview of UrbanLoc: Given a natural language description of a user's surroundings, UrbanLoc predicts the user's geographic location by retrieving over geotagged street-view images.

Overview of UrbanLoc. Given a natural language description from a user (e.g., "I am standing under the traffic light at the intersection lined with tall brown buildings, right in front of the white shop with the 'LOS TACOS' sign"), UrbanLoc predicts the user's geographic location and dispatches a robot to the estimated destination.

Abstract

Autonomous personal mobility and on-demand robotic delivery services require estimating user locations in diverse environments in order to dispatch robots to appropriate pickup or drop-off points. Although GNSS-based user localization is widely used for this purpose, it does not always function reliably due to signal degradation in urban canyons or in situations where sensing is restricted. In contrast, language-guided user localization offers a complementary, hardware-agnostic alternative, allowing users to describe their surroundings in natural language under modality-restricted conditions. Such language-guided user localization is challenging because users typically refer only to salient landmarks or text on signs, leading to a many-to-many correspondence between linguistic descriptions and possible locations in large-scale environments. To address this, we propose UrbanLoc, a two-stage language-guided user localization method via multimodal retrieval over geotagged street-view images. Unlike existing approaches, UrbanLoc leverages complex vision-language cues (e.g., storefront signage) that users typically describe in practical large-scale environments. To this end, we introduce location-aware vision-language alignment using area summaries, along with multimodal large language model-based reranking using object coverage. To validate UrbanLoc, we construct the UrbanLoc-Text benchmark consisting of 34k descriptions referring to 15k locations within 200 km² regions. Experimental results show that UrbanLoc outperforms baseline methods on standard evaluation metrics.

Method

UrbanLoc consists of three modules: Location-Aware Text Representation Module (LTRM), Multi-Scale Visual Representation Module (MVRM), and Object Coverage Reranking Module (OCRM).

LTRM maps an input description to a location-aware text representation. It applies location-aligned, unimodal, and multimodal text encoders to the description and concatenates the resulting features, which are then processed by a deep SSM. An auxiliary location head estimates a geographic coordinate supervised by a great-circle distance regression loss, encouraging the representation to reflect geographic proximity.

MVRM captures both visual and location-related cues from each geotagged image through three submodules: (i) a CoT Narrative Representation that generates a Chain-of-Thought explanation of the scene to capture fine-grained object attributes and spatial relationships; (ii) an Area Summary that aggregates representative location descriptions within a local area to capture shared area-level characteristics; and (iii) Scene Text Normalization that resolves multilingual mismatches between user descriptions and OCR-detected text in images.

OCRM reranks candidate locations by computing an object coverage rate — the fraction of objects mentioned in the user's description that are matched in the candidate image's CoT narrative explanation — and adding a weighted coverage term to the retrieval score.

UrbanLoc architecture showing LTRM, MVRM, and OCRM modules

Architecture of UrbanLoc. The model takes a user description and a set of geotagged street-view images. LTRM encodes the description with location awareness, MVRM encodes each image with multi-scale visual features, and OCRM reranks top candidates based on object coverage to produce the final location estimate.

Key Contributions

  • UrbanLoc: A two-stage language-guided user localization method via multimodal retrieval over geotagged street-view images, leveraging complex vision-language cues (e.g., storefront signage) that users typically describe in practice.
  • Location-aware vision-language alignment: An auxiliary location head with great-circle distance supervision integrated into the text encoder, enabling the model to reflect geographic proximity in its representations.
  • Area summary: A submodule that captures area-level characteristics shared across nearby locations, mitigating the partial and viewpoint-dependent nature of individual street-view observations.
  • Object Coverage Reranking Module (OCRM): A reranking module that promotes candidate locations matching all objects specified in the user's description, addressing the limitation of global cosine similarity scores that may over-rank partially matching candidates.
  • UrbanLoc-Text benchmark: A new benchmark consisting of 34k descriptions referring to 15k locations across 200 km² regions in multiple cities (New York and Tokyo).

Quantitative Results

Comparison between UrbanLoc and baseline methods on the New York and Tokyo subsets of the UrbanLoc-Text benchmark. The mean and standard deviation over five trials are reported for fine-tuned models. Best scores are in bold, second-best are underlined.

Method [%] FT New York Tokyo
L@50 ↑ R@1 ↑ R@5 ↑ L@50 ↑ R@1 ↑ R@5 ↑
CLIP 10.5 10.1 19.1 11.1 10.8 22.0
CLIP 🔥 18.9 18.3 31.8 19.3 18.8 34.2
BEiT-3 14.6 14.1 29.0 14.9 14.5 31.9
BEiT-3 🔥 17.5 16.6 31.6 17.8 17.3 34.7
BLIP-2 6.4 6.2 14.1 7.7 7.2 16.8
BLIP-2 🔥 17.9 17.7 34.2 19.8 19.4 37.6
SigLIP 16.1 15.5 28.3 16.4 16.2 30.6
SigLIP 🔥 21.5 20.7 35.3 18.3 17.8 34.5
Long-CLIP 11.4 10.8 21.7 13.1 12.7 26.2
Long-CLIP 🔥 17.2 16.6 30.9 18.4 17.8 34.5
CrossText2Loc 1.2 1.0 3.1 1.1 0.9 2.6
CrossText2Loc 🔥 2.5 2.2 5.8 3.2 2.9 4.5
SigLIP 2 15.1 14.9 28.4 14.5 14.2 29.6
SigLIP 2 🔥 19.7 18.9 34.3 22.7 22.1 40.7
Meta CLIP 2 13.5 13.0 23.8 15.1 14.8 27.8
Meta CLIP 2 🔥 15.6 15.2 27.8 18.3 17.8 32.8
Qwen3-VL-Embedding 17.3 16.7 30.0 18.8 18.3 31.8
Qwen3-VL-Embedding 🔥 17.9 17.2 31.2 19.2 18.8 32.9
Qwen3-VL-Embedding + Qwen3-VL-Reranker 🔥 26.8 25.9 36.3 26.6 26.0 36.5
UrbanLoc (Ours) 🔥 29.6 28.3 44.3 31.8 31.1 49.1

❄ = zero-shot, 🔥 = fine-tuned. UrbanLoc achieved L@50 of 29.6% (New York) and 31.8% (Tokyo), outperforming all baselines on all metrics.

Qualitative Results

Success and failure cases of UrbanLoc compared with the strongest baseline (Qwen3-VL-Embedding + Qwen3-VL-Reranker).

Success Cases

Qualitative success examples of UrbanLoc on New York and Tokyo subsets

Qualitative results of UrbanLoc and a baseline method (Qwen3-VL-Embedding + Qwen3-VL-Reranker). Green frames indicate the ground-truth location images. (i) New York: "I am on a street with many brick buildings that have external stairs, and I can see a white sign with a black heart on the external wall to my front left." UrbanLoc correctly ranked the target at rank 1; the baseline ranked it at rank 104 (1.6 km error). (ii) Tokyo: "There is a parking lot on the left with a yellow sign saying max 3,600 yen, and a red sign for a fire hydrant on the right." UrbanLoc ranked the target at rank 1 by leveraging the "3,600 yen" text on the sign; the baseline ranked it at rank 85 (11.9 km error).

Failure Case

Qualitative failure example of UrbanLoc

Representative failure case. Description — "I am beside a brown building, and across from it I can see a building that is white on the first floor and beige on the second floor and above." UrbanLoc ranked the target at rank 240 (56.7 km error); the baseline ranked it at rank 518 (58.0 km error). This failure stems from difficulty in capturing fine-grained combinations of visual attributes — each of the top-3 predicted locations contained some of the color cues (brown, white, or beige), but none matched the specific combination and arrangement of all three.

Ablation Study

Each component of UrbanLoc contributes to overall performance. LH = Location Head, CoT NR = CoT Narrative Representation, AS = Area Summary, STN = Scene Text Normalization, OCRM = Object Coverage Reranking Module.

Model New York Tokyo
L@50 [%] ↑ R@1 [%] ↑ R@5 [%] ↑ L@50 [%] ↑ R@1 [%] ↑ R@5 [%] ↑
(i) UrbanLoc (full) 29.6 28.3 44.3 31.8 31.1 49.1
(ii) w/o LH 27.9 26.7 41.7 30.1 29.4 45.5
(iii) w/o CoT NR 27.9 26.7 41.3 28.7 28.0 43.3
(iv) w/o AS 28.0 26.8 41.6 29.4 28.7 44.6
(v) w/o STN 27.9 26.9 42.1 29.8 29.2 45.1
(vi) w/o OCRM 23.9 22.9 39.5 25.0 24.3 43.1

Removing OCRM causes the largest performance drop (−5.7 / −6.8 points on L@50), highlighting the importance of object-level agreement. All other components also contribute meaningfully to localization accuracy.

UrbanLoc-Text Benchmark

We constructed the UrbanLoc-Text benchmark for language-guided user localization using geotagged street-view images collected via Mapillary. Test-set descriptions were collected from human annotators (151 annotators), while training/validation descriptions were generated using Qwen3-VL given street-view images and OCR-detected text. The benchmark is geographically split so that every pair of locations from different splits is separated by at least 20 m.

Train / Val Test
New York Tokyo New York Tokyo
Descriptions 12,298 14,790 3,235 3,673
Locations 6,149 7,395 647 763
Images 6,149 7,395 647 763
Vocabulary size 10,254 11,323 3,874 4,649
Word count 515,234 637,535 88,969 101,514
Avg. description length (words) 41.90 43.11 27.50 27.64
Annotators 151

Error Analysis

We analyzed 100 failure cases (50 per subset) with the largest great-circle distance errors. Five major failure modes were identified.

43%
Compositionality
error

Fails to capture compositional relationships among multiple concepts, including object co-occurrence and attribute–object binding.

20%
Spatial comprehension
error

Fails to correctly understand spatial relationships among multiple objects, such as relative positions or ordering.

18%
Scene text comprehension
error

Fails to properly interpret or utilize scene text information (e.g., signage text) in the candidate image.

12%
Ambiguous
description

Multiple candidate locations could match the given description, making it unclear which location is intended.

7%
OCR
error

The OCR system fails to detect scene text in the candidate image, preventing text-based cues from being leveraged.

Compositionality errors are the primary bottleneck (43%). A potential direction is to incorporate an $\ell_1$-product metric over separate embedding factors, which captures cross-family composition in a Boolean-algebra-like manner.

BibTeX

Coming soon.