Abstract:
Objectives Urban imagery, including high-resolution remote sensing imagery and street view imagery, provides a detailed representation of the physical environment of cities. It enables multi-scale analysis ranging from global perspectives to microscopic details. Extracting high-level semantic features from the vast and complex pixel data of urban imagery for applications in pattern recognition and decision-making support has long been a critical focus in urban studies.
Methods Traditional approaches, which progress from handcrafted shallow features like color histograms to semantic element representations, are often insufficient. These methods are limited by expert knowledge and fail to capture deep attributes or complex spatial relationships inherent in urban environments. In response, computational representation methods supported by representation learning can extract high-dimensional deep features from urban imagery. These features not only capture richer urban semantic and structural information but also facilitate more effective multimodal data integration by reducing semantic conflicts, thereby enabling the development of more accurate and robust urban models. Unlike handcrafted features, these computational representations are inherently data-driven and learn directly from the intrinsic structure and patterns within the raw data itself. Notably, intelligent computational representations based on self-supervised learning (SSL) stand out as a critical solution to the data-labeling bottleneck. Given the massive scale of urban imagery and the prohibitively high cost of manual annotation, which often suffers from inconsistent quality, SSL advances the automation of urban imagery analysis by learning powerful representations from the data's own intrinsic properties, requiring no labeled datasets. We believe that the most effective SSL representations, moving beyond generic augmentations which are ill-suited for complex urban tasks, must be task-centric. This is achieved by autonomously encoding key information by designing specialized pretext tasks. These tasks utilize the unique attributes and structure of urban imagery to systematically learn distinct representations, such as isolating time-invariant features by ignoring dynamic elements for location recognition, versus capturing ambient socio-economic and cultural semantics for regional-level prediction tasks.
Results We explore the characteristics, evolutionary trajectory, and interpretability of these intelligent computational representations. Given that deep features often lack explicit semantic meaning, we emphasize interpretability analysis as essential for validating the effectiveness of the encoding process and understanding what information has been encoded. We discuss a two-fold approach: (1) Revealing the semantic content within features using methods like SemAxis or visualizing clustering and separation patterns in low-dimensional space via T-SNE and PacMap. (2) Discovering the model's focus using visualization techniques like Grad-CAM and Attention Maps to understand which image regions are critical for a decision.
Conclusions Consequently, these advancements offer more precise support for urban research, planning, and sustainable development. We also highlight future opportunities and challenges, including: (1) Developing robust multimodal alignment to bridge the gap between areal remote sensing features and point-based street view representations. (2) Integrating visual features with large language models (LLM) to summarize complex urban concepts, facts, and opinions. (3) Addressing the fundamental "black box" nature of deep features to advance AI for Science.