Mission: Impossible – Image-Based Geolocation with Large Vision Language Models
Authors: Yi Liu (Quantstamp), Gelei Deng (Nanyang Technological University), Junchen Ding (University of New South Wales), Yuekang Li (University of New South Wales), Tianwei Zhang (Nanyang Technological University), Weisong Sun (Nanyang Technological University), Yaowen Zheng (Institute of Information Engineering, Chinese Academy of Sciences), Jingquan Ge (Nanyang Technological University)
Volume: 2025
Issue: 4
Pages: 410–428
DOI: https://doi.org/10.56553/popets-2025-0137
Abstract: In the age of ubiquitous smartphone use and widespread image sharing on social platforms, geolocation poses a critical privacy concern. Images often carry sensitive spatial and temporal details—such as street signs, architectural styles, or landmarks—that can inadvertently disclose the precise whereabouts of individuals and organizations. Recent advances in large vision-language models (LVLMs) present an emerging threat by enabling users, regardless of technical expertise, to extract location cues from seemingly benign photos. While existing AI-driven geolocation solutions often focus on narrow datasets or specialized contexts, the generalizable performance and privacy implications of zero-shot LVLMs in real-world settings remain critical questions. In this paper, we investigate the geolocation capabilities of state-of-the-art LVLMs. Our findings reveal that while these models demonstrate a non-negligible capability for image-based geolocation even without specialized training, their accuracy in absolute terms is often low, exposing clear limitations in their current state. We then introduce ETHAN, a framework integrating chain-of-thought (CoT) reasoning. Although ETHAN shows improved performance (e.g., 28.7% accuracy at the 1km threshold) and an 85.4% win rate on GeoGuessr, these results primarily highlight the potential trajectory of such technologies rather than their current widespread, high-accuracy applicability. Our study underscores the dual nature of LVLMs in this domain: they uncover an emerging privacy risk due to their inherent, albeit limited, geolocation abilities, yet also demonstrate significant constraints. We conclude by calling for further research into the limitations and risks of LVLM-based geolocation and the development of effective mitigation strategies to protect sensitive location data.
Keywords: geolocation, vision-language models, multimodal reasoning, location inference, security and privacy
Copyright in PoPETs articles are held by their authors. This article is published under a Creative Commons Attribution 4.0 license.
