Objectives Aiming at the problem of inconsistent data distribution between cross-modal remote sensing information caused by "heterogeneity gap", a new cross-modal remote sensing dataset is constructed and released for public.
Methods To solve the problem of "heterogeneity gap", a general cross-modal correlation learning method (CCLM) is proposed for remote sensing. Based on the latent semantic consistency between different modality information, CCLM includes two stages: The learning of feature representation and the construction of common feature space. Firstly, deep neural networks are adopted to extract the feature representation of image and sequence information. To construct a common feature space, a new loss function is designed for correlation learning, by exploring the semantic consistency within intra-modality and complementary information contained in inter-modality. Secondly, knowledge distillation is used to enhance the semantic relevance to achieve the semantic consistency of common space.
Results The experiments are performed on our dataset. The experimental results show that the mean average precision (mAP) of our CCLM on cross-modal retrieval tasks exceeds 70%.
Conclusions The results outperform other baseline methods, and verify effectiveness of the proposed dataset and method.