智能无人系统基于几何约束注意力和点体素卷积的三维占据预测

王勇; 夏书颖; 叶瞳晖; 栾泽恺; 郑若溪; 高智; 谢凯

doi:10.13203/j.whugis20250240

智能无人系统基于几何约束注意力和点体素卷积的三维占据预测

Geometric-Constrained Attention and Point‑Voxel Convolutions for 3D Occupancy Prediction in Intelligent Unmanned Systems

摘要

摘要: 环境感知作为智能无人系统（Intelligent Unmanned Systems，IUS）的“眼睛”，在IUS中发挥着举足轻重的作用。三维占据预测（3D Occupancy Prediction，3D-OP）将真实世界建模为带有语义的精细体素结构，是当前智能无人系统的核心环境感知任务。然而，现存的3D-OP方案在一定程度上沿用了三维目标检测的技术方法，在三维特征的处理上存在诸多问题，如特征稀疏、特征语义错误等。针对上述问题，本文提出了一种端到端的表面特征传播三维占据预测网络。该网络引入几何约束注意力机制和点-体素卷积，分别设计了三维特征构造模块和三维特征传播模块；同时针对环境中各类别的分布情况，增设了基于掩码注意力的类别细化模块。本方法主要面向车载与船载等地面/水面场景设计验证，在公开户外驾驶数据集SemanticKITTI、Occ3D-nuScenes及自建海面场景数据集的测试中定性与定量结果均表现优异，展现出良好的预测精度与场景适应能力，验证了其在IUS中的实用性与泛化潜力，为IUS环境感知提供有效解决方案。

Abstract: Objectives: Environment perception serves as the “eye” of Intelligent Unmanned Systems (IUS), playing a pivotal role in safe and reliable navigation for ground and sea-surface autonomous platforms. Recently, vision-centered 3D occupancy prediction (3D-OP) has become a central perception task as it simultaneously generates dense voxel-level estimates of occupancy and semantic labels for both foreground and background. However, many existing methods adopt architecture designs from 3D object detection and consequently suffer from sparse 3D features and semantic misalignment. Specifically, features extracted by these approaches are frequently insufficiently dense, lack precise positional accuracy, and do not align well with the true geometry and semantics of the scene. Therefore, the objective of this study is to propose a vision-based three-dimensional occupancy prediction model that can overcome the above limitations and produce accurate, dense, and semantically consistent 3D occupancy predictions. We also aim to validate the model’s effectiveness and deployability in ground and sea surface scenarios. Methods: To meet the objectives, we propose the Surface-feature Propagation 3D Occupancy Prediction Network (SPOcc). The proposed approach integrates three specialized modules designed to address the identified limitations. First, a 3D feature construction module (FC3D) driven by geometry-constrained attention (GCA) uses depth guidance to locate likely surface positions within observed regions. Features are constructed selectively at these surface candidate positions, producing positionally accurate and information-rich 3D representations while suppressing spurious responses in empty space. Second, a 3D feature propagation module (FP3D) formulated in a point-voxel convolution paradigm treats surface voxels as pseudo points to capture global context through point-based aggregation, and then applies voxel-level convolutions with multiple receptive fields to diffusely propagate features from visible areas into occluded regions. This design enforces spatial coherence and enhances inference robustness under heavy occlusion, which is a key challenge in 3D-OP tasks. Third, a maskbased transformer refinement module (MTCR) predicts binary masks for each semantic class and applies masked self-attention confined within each mask to refine per-class occupancy estimates. By constraining attention to same-class regions, the MTCR module reduces inter-class interference and strengthens discrimination between closely adjacent or visually similar categories. The entire pipeline is trained end to end using semantic and geometric supervision. To investigate the trade-off between the effectiveness and efficiency of SPOcc, we conduct experiments in both ground and sea-surface scenarios to evaluate the model’s accuracy, computational cost, and practical deployability. Results: Extensive experiments were conducted on multiple publicly available outdoor driving datasets, namely SemanticKITTI and Occ3D-nuScenes, as well as on a self-collected sea-surface scene dataset. Across diverse scene types, the proposed SPOcc approach outperforms state-of-the-art baselines in overall intersection over union and in per-class accuracy across diverse scene types. Quantitative evaluations demonstrate notable improvements in scene-level consistency and in the detection of small or partially occluded objects. Qualitative visualizations reveal crisper object boundaries, reduced semantic leakage into background regions, and more reliable recovery of occluded structures compared with competing methods. The proposed approach achieves these accuracy improvements while incurring a modest increase in computational and memory cost. In typical deployment settings for vehicles and vessels, the accuracy gains substantially outweigh the additional resource requirements. Ablation studies confirm that each module contributes meaningfully to final performance: geometry-constrained feature construction improves positional fidelity, point-voxel propagation enhances occlusion recovery and spatial coherence, and mask-based transformer refinement increases per-class discrimination. Conclusions: The proposed SPOcc framework effectively addresses the challenges of feature sparsity, semantic misalignment, and occlusion inference in vision-based 3D occupancy prediction. By unifying three complementary modules for geometry-aware feature construction, point–voxel propagation, and mask-constrained transformer refinement, our work yields more accurate and reliable dense occupancy and semantic estimates across ground and sea-surface scenes. These improvements suggest strong practical value for real-world IUS. Future work will explore further computational optimizations, integration of complementary low-cost multimodal cues for increased robustness under adverse environmental conditions, and broader evaluation across additional platforms and viewpoints to assess and extend the method’s generalization.

HTML全文

参考文献(0)

施引文献

资源附件(0)