Abstract:
Urban imagery provides a detailed representation of the physical environment of cities, enabling multi-scale analysis ranging from global perspectives to microscopic details. Extracting high-level semantic features from the vast and complex pixel data of urban imagery—through efficient feature engineering methods—for applications in pattern recognition and decision-making support has long been a critical focus in urban studies. Compared to traditional approaches that rely on manually defined semantic element representations, we find that computational representation methods supported by representation learning can extract high-dimensional deep features from urban imagery. These features not only capture richer urban semantic and structural information but also facilitate multi-modal data integration and the development of more accurate and robust urban models. Notably, intelligent computational representations based on self-supervised learning stand out, as they can autonomously encode task-centric key information without the need for labeled data, thereby advancing the automation of urban imagery analysis. This paper explores the characteristics, evolutionary trajectory, and interpretability of intelligent computational representations in urban imagery, highlighting their potential to significantly enhance the capabilities of intelligent urban analysis. Consequently, these advancements offer more precise and reliable support for urban research, planning, management, and sustainable development.