Abstract:
Objectives Cloud cover hampers optical remote sensing, especially in high-altitude alpine regions. In Southeastern Tibetan Plateau, cloud cover reaches 62%, which limits the use of optical image. While single-image cloud removal method is faster than temporal methods, it faces technical hurdles. The existing methods struggle under thick clouds, and generative adversarial network-based approaches often produce artifacts and poor interpretability. We propose a weighted masking cloud removal model designed for complex alpine climates, aiming for accurate surface restoration under various cloud conditions, reducing artifacts and improving robustness in mountainous terrains with snow cover.
Methods The proposed model combines cloud opacity estimation with an advanced image generation framework. It begins by formalizing assumptions about cloud opacity and brightness to address stratified cloud phenomena in remote sensing images. And its core is an improved Transformer-based generator, which introduces a redesigned multi-head mask attention (MMA) mechanism. This mechanism uses a dynamically generated weighted mask created from estimated cloud opacity and brightness compensation maps to modulate neuron activation. The mask strategically suppresses features from heavily cloud-obscured pixels during aggregation, focusing the model's attention first on clearer regions and cloud edges. Additionally, a progressive sliding window mask update strategy is adopted, which gradually shrinks the inhibitory mask as the network deepens. This allows the model to iteratively propagate reliable information from outer regions into thick cloud cores, enabling full restoration. The model architecture adopts a dual-generator design, in which the first generator is responsible for estimating cloud opacity and brightness compensation values to generate guiding masks and the second generator integrated with MMA blocks is dedicated to synthesizing the final cloud-free image. Training is guided by a composite loss function that integrates non-saturating adversarial loss, structural similarity (SSIM) loss, and perceptual loss, ensuring visual fidelity, structural accuracy, and feature consistency. For evaluation, a dedicated alpine dataset is built using Sentinel-2 Level-2A imagery in Southeastern Tibetan Plateau. Cloud masks from S2cloudless are used to create precise cloudy-clear pairs, and a simulation protocol generates physically realistic training samples with corresponding opacity and compensation ground truth.
Results The proposed model is rigorously evaluated against three state-of-the-art methods. On quantitative metrics, it outperforms all counterparts, achieving a mean absolute error of 0.025 6, root mean square error of 0.035 6, peak signal-to-noise ratio of 30.185 1, and SSIM index of 0.899 6. Visually, it excels in texture preservation and artifact reduction, especially under moderate (20%-30%) and heavy (>30%) cloud cover, where others show significant distortion. The proposed model also demonstrates robustness in snow-covered alpine scenes.
Conclusions This work presents an effective cloud removal model for alpine regions by unifying thin-cloud correction and thick-cloud inpainting within a Transformer framework. The weighted mask and progressive update strategy provide a targeted explainable restoration process. This approach significantly enhances optical image usability in cloudy mountain areas and offers valuable support for related scientific research.