ZHANG Liangpei, ZHANG Lefei, YUAN Qiangqiang. Large Remote Sensing Model: Progress and Prospects[J]. Geomatics and Information Science of Wuhan University, 2023, 48(10): 1574-1581. DOI: 10.13203/j.whugis20230341
Citation: ZHANG Liangpei, ZHANG Lefei, YUAN Qiangqiang. Large Remote Sensing Model: Progress and Prospects[J]. Geomatics and Information Science of Wuhan University, 2023, 48(10): 1574-1581. DOI: 10.13203/j.whugis20230341

Large Remote Sensing Model: Progress and Prospects

More Information
  • Received Date: September 16, 2023
  • Available Online: September 24, 2023
  • In recent years, significant advancements in large language models and visual foundation models in the field of artificial intelligence have attracted scholars' attention to the potential of general artificial intelligence technology in remote sensing. These studies have propelled a new paradigm in the research of large models for remote sensing information processing. Large remote sensing models, also known as pre-trained foundation remote sensing models, are a kind of methodology that employs a vast amount of unlabeled remote sensing images to train large-scale deep learning models. The goal is to extract universal feature representations from remote sensing images, thereby enhancing the performance, efficiency, and versatility of remote sensing image analysis tasks. Research on large remote sensing models involves three key factors, including pre-training datasets, model parameters, and pre-training techniques. Among them, pre-training datasets and model parameters can be flexibly expanded with the increase in data and computational resources, while pre-training techniques are critical for improving the performance of large remote sensing models. This review focuses on the pre-training techniques of large remote sensing models and systematically analyzes the existing supervised single-modal pre-trained large remote sensing models, unsupervised single-modal pre-trained large remote sensing models, and visual-text joint multimodal pre-trained large remote sensing models. The conclusion section provides prospects for large remote sensing models in terms of integrating domain knowledge and physical constraint, enhancing data generalization, expanding application scenarios, and reducing data costs.

  • [1]
    Zhang L P, Zhang L F, Du B. Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art[J]. IEEE Geoscience and Remote Sensing Magazine, 2016, 4(2): 22-40. doi: 10.1109/MGRS.2016.2540798
    [2]
    Chen Y S, Lin Z H, Zhao X, et al. Deep Learning-based Classification of Hyperspectral Data[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2014, 7(6): 2094-2107. doi: 10.1109/JSTARS.2014.2329330
    [3]
    Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models from Natural Language Supervision[C]//International Conference on Machine Learning, Salt Lake City, USA, 2021.
    [4]
    Yuan L, Chen D D, Chen Y L, et al. Florence: A New Foundation Model for Computer Vision[EB/OL]. (2021-09-25)[2023-05-23]. https://arxiv.org/abs/2111.11432.
    [5]
    Bao H, Dong L, Piao S, et al. BEiT: BERT Pre-Training of Image Transformers[C]//The Tenth International Conference on Learning Representations, Vienna, Austria, 2022.
    [6]
    Brown T B, Mann B, Ryder N, et al. Language Models are Few-shot Learners[C]//The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020.
    [7]
    Zhang S S, Roller S, Goyal N, et al. OPT: Open Pre-trained Transformer Language Models[J]. CoRR, 2022, abs/2205.01068.
    [8]
    Raffel C, Shazeer N, Roberts A, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer[J]. Journal of Machine Learning Research, 2020, 21(140): 1-67.
    [9]
    Wang Y, Albrecht C M, Zhu X X. Self-supervised Vision Transformers for Joint SAR-optical Representation Learning[C]//IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.
    [10]
    Reed C J, Gupta R, Li S F, et al. Scale-MAE: A Scale-aware Masked Autoencoder for Multiscale Geospatial Representation Learning[EB/OL]. (2022-12-30)[2023-05-23]. https://arxiv.org/abs/2212.14532.
    [11]
    Chen Z L, Wang Y Y, Han W, et al. An Improved Pre-training Strategy-based Scene Classification with Deep Learning[J]. IEEE Geoscience and Remote Sensing Letters, 2020, 17(5): 844-848. doi: 10.1109/LGRS.2019.2934341
    [12]
    Risojević V, Stojnić V. Do We Still Need ImageNet Pre-training in Remote Sensing Scene Classification?[EB/OL]. (2021-11-05)[2023-05-23]. https://arxiv.org/abs/2111.03690.
    [13]
    Muhtar D, Zhang X L, Xiao P F. Index Your Position: A Novel Self-Supervised Learning Method for Remote Sensing Images Semantic Segmentation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-11.
    [14]
    Wang D, Zhang J, Du B, et al. An Empirical Study of Remote Sensing Pre-training[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-20.
    [15]
    Li H F, Li Y, Zhang G, et al. Global and Local Contrastive Self-Supervised Learning for Semantic Segmentation of HR Remote Sensing Images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-11.
    [16]
    Wang D, Zhang Q M, Xu Y F, et al. Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-15.
    [17]
    Liu F, Chen D L, Guan Z, et al. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing[EB/OL]. (2023-06-19)[2023-09-15]. https://arxiv.org/abs/2306.11029.
    [18]
    Hu Y, Yuan J L, Wen C C, et al. RSGPT: A Remote Sensing Vision Language Model and Benchmark[EB/OL]. (2023-07-28)[2023-09-10]. https://arxiv.org/abs/2307.15266.
    [19]
    Fuller A, Millard K, Green J R. Transfer Learning with Pretrained Remote Sensing Transformers[EB/OL]. (2022-09-28)[2023-08-20]. https://arxiv.org/abs/2209.14969.
    [20]
    Noman M, Fiaz M, Cholakkal H, et al. Remote Sensing Change Detection with Transformers Trained from Scratch[EB/OL]. (2023-03-13)[2023-08-20]. https://arxiv.org/abs/2304.06710.
    [21]
    Aharon M, Elad M, Bruckstein AM. K -SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation[J]. IEEE Transactions on Signal Processing, 2006, 54(11): 4311-4322. doi: 10.1109/TSP.2006.881199
    [22]
    Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507. doi: 10.1126/science.1127647
    [23]
    Stojnić V, Risojević V. Self-supervised Learning of Remote Sensing Scene Representations Using Contrastive Multiview Coding[EB/OL]. (2023-03-14)[2023-05-23]. https://arxiv.org/abs/2104.07070.
    [24]
    Ayush K, Uzkent B, Meng C L, et al. Geography-Aware Self-supervised Learning[EB/OL]. (2020-11-19)[2023-05-23]. https://arxiv.org/abs/2011.09980.
    [25]
    Mañas O, Lacoste A, Giro-i-Nieto X, et al. Seasonal Contrast: Unsupervised Pre-training from Uncurated Remote Sensing Data[EB/OL]. (2021-04-30)[2023-05-23]. https://arxiv.org/abs/2103.16607.
    [26]
    He K M, Chen X L, Xie S N, et al. Masked Autoencoders are Scalable Vision Learners[EB/OL]. (2022-12-30)[2023-05-23]. https://arxiv.org/abs/2111.06377.
    [27]
    Muhtar D, Zhang X L, Xiao P F, et al. CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-17.
    [28]
    Zhang Y X, Zhao Y, Dong Y N, et al. Self-Supervised Pre-training via Multimodality Images with Transformer for Change Detection[J]. IEEE Tran-sactions on Geoscience and Remote Sensing, 1024, 61: 1-11.
    [29]
    Li Y Y, Alkhalifah T, Huang J P, et al. Self-supervised Pre-training Vision Transformer with Masked Autoencoders for Building Subsurface Model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, DOI: 10.1109/TGRS.2023.3308999.
    [30]
    Zhang T, Zhuang Y, Chen H, et al. Object-centric Masked Image Modeling-based Self-supervised Pre-training for Remote Sensing Object Detection[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 5013-5025. doi: 10.1109/JSTARS.2023.3277588
    [31]
    Sun X, Wang P J, Lu W X, et al. RingMo: A Remote Sensing Foundation Model with Masked Image Modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 61: 1-22.
    [32]
    Cong Y Z, Khanna S, Meng C L, et al. SatMAE: Pre-training Transformers for Temporal and Multi-spectral Satellite Imagery[EB/OL]. (2022-12-30)[2023-05-23]. https://arxiv.org/abs/2207.08051.
    [33]
    Tseng G, Zvonkov I, Purohit M, et al. Lightweight, Pre-trained Transformers for Remote Sensing Timeseries[EB/OL]. (2022-12-30)[2023-05-23]. https://arxiv.org/abs/2304.14065.
    [34]
    Zhang T, Gao P, Dong H, et al. Consecutive Pre-Training: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain[J]. Remote Sensing, 2022, 14(22): 5675. doi: 10.3390/rs14225675
    [35]
    Cha K, Seo J, Lee T. A Billion-scale Foundation Model for Remote Sensing Images[EB/OL]. (2022-12-30)[2023-05-23]. https://arxiv.org/abs/2304.05215.
    [36]
    Bazi Y, Al Rahhal M M, Mekhalfi M L, et al. Bi-modal Transformer-based Approach for Visual Question Answering in Remote Sensing Imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-11.
    [37]
    Mikriukov G, Ravanbakhsh M, Demir B. Deep Unsupervised Contrastive Hashing for Large-scale Cross-modal Text-image Retrieval in Remote Sensing[EB/OL]. (2022-12-30)[2023-05-23]. https://arxiv.org/abs/2201.08125.
    [38]
    Chappuis C, Zermatten V, Lobry S, et al. Prompt–RSVQA: Prompting Visual Context to a Language Model for Remote Sensing Visual Question Answering[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, USA, 2022.
    [39]
    Liu C Y, Zhao R, Chen J Q, et al. A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning [J]. Geoscience, 2023, DOI: 10.1109/TGRS.2022.3232784.
    [40]
    Li J N, Li D X, Savarese S, et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models[EB/OL]. (2022-12-30)[2023-05-23]. https://arxiv.org/abs/2301.12597.
    [41]
    Wei T T, Yuan W L, Luo J R, et al. VLCA: Vision-language Aligning Model with Cross-modal Attention for Bilingual Remote Sensing Image Captioning[J]. Journal of Systems Engineering and Electronics, 2023, 34(1): 9-18. doi: 10.23919/JSEE.2023.000035
    [42]
    Zhang Z L, Zhao T C, Guo Y L, et al. RS5M: A Large Scale Vision-language Dataset for Remote Sensing Vision-language Foundation Model[EB/OL]. (2023-06-20)[2023-08-23]. https://arxiv.org/abs/2306.11300.
    [43]
    Yang Y, Zhuang Y T, Pan Y H. Multiple Know-ledge Representation for Big Data Artificial Intelligence: Framework, Applications, and Case Studies[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(12): 1551-1558.
    [44]
    Yang Z X, Yang Y. Decoupling Features in Hierarchical Propagation for Video Object Segmentation[EB/OL]. (2022-10-18)[2023-08-23]. https://arxiv.org/abs/2210.09782.
    [45]
    Zhang X M, Wu C Y, Zhang Y, et al. Knowledge-enhanced Visual-Language Pre-training on Chest Radiology Images[J]. Nature Communications, 2023, 14: 4542. doi: 10.1038/s41467-023-40260-7
    [46]
    Zhou K Y, Yang J K, Loy C C, et al. Learning to Prompt for Vision-Language Models[J]. International Journal of Computer Vision, 2022, 130(9): 2337-2348.
    [47]
    Zhong Y W, Yang J W, Zhang P C, et al. RegionCLIP: Region-based Language-Image Pre-training[EB/OL]. (2021-12-16)[2023-05-23]. https://arxiv.org/abs/2112.09106.
    [48]
    Rao Y M, Zhao W L, Chen G Y, et al. DenseCLIP: Language-guided Dense Prediction with Context-Aware Prompting[EB/OL]. (202-12-02)[2023-05-23]. https://arxiv.org/abs/2112.01518.
    [49]
    Jia M, Tang L, Chen B C, et al. Visual Prompt Tuning[C]//17th European Conference on Computer Vision, Tel Aviv, Israel, 2022.
    [50]
    Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-Efficient Transfer Learning for NLP[EB/OL]. (2019-02-02)[2023-05-23]. https://arxiv.org/abs/1902.00751.
    [51]
    Liu Y, Zhu G, Zhu B, et al. TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training[C]//Annual Conference on Neural Information Processing Systems, New Orleans, USA, 2022.
    [52]
    Rasheed H, Maaz M, Khattak M U, et al. Bridging the Gap Between Object and Image-Level Representations for Open-Vocabulary Detection[EB/OL]. (2022-07-07)[2023-05-23]. https://arxiv.org/abs/2207.03482.
    [53]
    Mal Z, Luo G, Gao J, et al. Open-Vocabulary One-stage Detection with Hierarchical Visual-Language Knowledge Distillation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 2022.
    [54]
    Xie J, Zheng S. ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation[EB/OL]. (2021-09-24)[2023-05-23]. https://arxiv.org/abs/2109.12066.
    [55]
    Rombach R, Blattmann A, Lorenz D, et al. High-Resolution Image Synthesis with Latent Diffusion Models[C]//IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022.
    [56]
    Wang W L, Bao J M, Zhou W G, et al. Semantic Image Synthesis via Diffusion Models[EB/OL]. (2022-06-30)[2023-05-23]. https://arxiv.org/abs/2207.00050.
    [57]
    Dhariwal P, Nichol A Q. Diffusion Models Beat GANs on Image Synthesis[C]//Annual Conference on Neural Information Processing Systems, Nice, France, 2021.
  • Related Articles

    [1]HE Chaoyang, XU Qiang, JU Nengpan, XIE Mingli. Optimization of Model Scheduling Algorithm in Real-Time Monitoring and Early Warning of Landslide[J]. Geomatics and Information Science of Wuhan University, 2021, 46(7): 970-982. DOI: 10.13203/j.whugis20200314
    [2]XU Zheng, ZOU Bin, ZHENG Zhong, PU Qiang, YANG Zhonglin, SUN Guoqing. A Dynamic Healthy-Route Search Algorithm and System Realization[J]. Geomatics and Information Science of Wuhan University, 2019, 44(1): 145-152. DOI: 10.13203/j.whugis20150749
    [3]ZHU Qing, HAN Huipeng, YU Jie, DU Zhiqiang, ZHANG Junxiao, WU Chen, SHEN Fuqiang. Multi-objective Optimization Scheduling Method for UAV Resources in Emergency Surveying and Mapping[J]. Geomatics and Information Science of Wuhan University, 2017, 42(11): 1608-1615. DOI: 10.13203/j.whugis20130000
    [4]ZHANG Dengyi, GUO Lei, WANG Qian, ZOU Hua. An Improved Single-orbit Scheduling Method for Agile ImagingSatellite Towards Area Target[J]. Geomatics and Information Science of Wuhan University, 2014, 39(8): 901-905. DOI: 10.13203/j.whugis20130233
    [5]DONG Jian, PENG Rencan, ZHENG Yidong. An Improved Algorithm of Point by Point Interpolation by Using Local Dynamic Optimal Delaunay Triangulation Network[J]. Geomatics and Information Science of Wuhan University, 2013, 38(5): 613-617.
    [6]TU Wei, LI Qingquan, FANG Zhixiang. A Heuristic Algorithm for Large Scale Vehicle Routing Problem[J]. Geomatics and Information Science of Wuhan University, 2013, 38(3): 307-310.
    [7]SONG Huanhuan, WANG Shuzong. Application of Information Entropy Theory to Analyzing Tasks Scheduling for Submarine Combat System[J]. Geomatics and Information Science of Wuhan University, 2012, 37(12): 1477-1481.
    [8]YANG Chuncheng, XIE Peng, HE Liesong, ZHOU Xiaodong. Data Scheduling for Map Data Reading[J]. Geomatics and Information Science of Wuhan University, 2009, 34(2): 166-169.
    [9]Li Mingshan, Lu Zhiyan. The Generalized Backtracking Method & the Optimum Task Scheduling of the Parallel Computer System[J]. Geomatics and Information Science of Wuhan University, 1996, 21(1): 90-95.
    [10]Zheng Zhaobao. An Automatic Searching Method of the Image Occlusion[J]. Geomatics and Information Science of Wuhan University, 1988, 13(4): 71-75.
  • Cited by

    Periodical cited type(0)

    Other cited types(1)

Catalog

    Article views PDF downloads Cited by(1)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return