城市影像的智能计算表征

黄颖菁; 张帆; 李勇; 邬伦; 刘瑜

doi:10.13203/j.whugis20240472

摘要: 城市影像能够详尽刻画城市物理环境，支持从全球到微观层面的多尺度分析。基于高效的特征工程方法，从庞大且复杂的城市影像像素数据中提取高层次语义特征，用于模式识别和决策支持，一直是城市研究的重要方向。相较于传统的语义要素表征，表示学习支持下的计算表征方法能够从城市影像中学习高维深度特征，这些特征不仅提炼了更丰富的城市语义与结构信息，还促进了多模态数据的融合和更精准、更鲁棒的城市模型的构建。特别地，基于自监督学习的智能计算表征，能够在无需标注数据的情况下自动编码与城市任务相关的关键信息，进一步提升了城市影像分析的自动化水平。通过探讨城市影像智能计算表征的特点、发展历程及其可解释性，发现该方法有望显著提升城市智能化分析能力，从而为城市研究、规划、管理和可持续发展提供更精准的决策支持。

Abstract:

Objectives Urban imagery, including high-resolution remote sensing imagery and street view imagery, provides a detailed representation of the physical environment of cities. It enables multi-scale analysis ranging from global perspectives to microscopic details. Extracting high-level semantic features from the vast and complex pixel data of urban imagery for applications in pattern recognition and decision-making support has long been a critical focus in urban studies.

Methods Traditional approaches, which progress from handcrafted shallow features like color histograms to semantic element representations, are often insufficient. These methods are limited by expert knowledge and fail to capture deep attributes or complex spatial relationships inherent in urban environments. In response, computational representation methods supported by representation learning can extract high-dimensional deep features from urban imagery. These features not only capture richer urban semantic and structural information but also facilitate more effective multimodal data integration by reducing semantic conflicts, thereby enabling the development of more accurate and robust urban models. Unlike handcrafted features, these computational representations are inherently data-driven and learn directly from the intrinsic structure and patterns within the raw data itself. Notably, intelligent computational representations based on self-supervised learning (SSL) stand out as a critical solution to the data-labeling bottleneck. Given the massive scale of urban imagery and the prohibitively high cost of manual annotation, which often suffers from inconsistent quality, SSL advances the automation of urban imagery analysis by learning powerful representations from the data's own intrinsic properties, requiring no labeled datasets. We believe that the most effective SSL representations, moving beyond generic augmentations which are ill-suited for complex urban tasks, must be task-centric. This is achieved by autonomously encoding key information by designing specialized pretext tasks. These tasks utilize the unique attributes and structure of urban imagery to systematically learn distinct representations, such as isolating time-invariant features by ignoring dynamic elements for location recognition, versus capturing ambient socio-economic and cultural semantics for regional-level prediction tasks.

Results We explore the characteristics, evolutionary trajectory, and interpretability of these intelligent computational representations. Given that deep features often lack explicit semantic meaning, we emphasize interpretability analysis as essential for validating the effectiveness of the encoding process and understanding what information has been encoded. We discuss a two-fold approach: (1) Revealing the semantic content within features using methods like SemAxis or visualizing clustering and separation patterns in low-dimensional space via T-SNE and PacMap. (2) Discovering the model's focus using visualization techniques like Grad-CAM and Attention Maps to understand which image regions are critical for a decision.

Conclusions Consequently, these advancements offer more precise support for urban research, planning, and sustainable development. We also highlight future opportunities and challenges, including: (1) Developing robust multimodal alignment to bridge the gap between areal remote sensing features and point-based street view representations. (2) Integrating visual features with large language models (LLM) to summarize complex urban concepts, facts, and opinions. (3) Addressing the fundamental "black box" nature of deep features to advance AI for Science.

城市影像的智能计算表征

Intelligent Computational Representation of Urban Imagery