韩汀, 陈思宇, 马津, 蔡国榕, 张吴明, 陈一平. 可学习深度位置编码引导的车前图像道路可行驶区域检测[J]. 武汉大学学报 ( 信息科学版), 2024, 49(4): 582-594. DOI: 10.13203/j.whugis20230252
引用本文: 韩汀, 陈思宇, 马津, 蔡国榕, 张吴明, 陈一平. 可学习深度位置编码引导的车前图像道路可行驶区域检测[J]. 武汉大学学报 ( 信息科学版), 2024, 49(4): 582-594. DOI: 10.13203/j.whugis20230252
HAN Ting, CHEN Siyu, MA Jin, CAI Guorong, ZHANG Wuming, CHEN Yiping. Road Image Free Space Detection via Learnable Deep Position Encoding[J]. Geomatics and Information Science of Wuhan University, 2024, 49(4): 582-594. DOI: 10.13203/j.whugis20230252
Citation: HAN Ting, CHEN Siyu, MA Jin, CAI Guorong, ZHANG Wuming, CHEN Yiping. Road Image Free Space Detection via Learnable Deep Position Encoding[J]. Geomatics and Information Science of Wuhan University, 2024, 49(4): 582-594. DOI: 10.13203/j.whugis20230252

可学习深度位置编码引导的车前图像道路可行驶区域检测

Road Image Free Space Detection via Learnable Deep Position Encoding

  • 摘要: 道路可行驶区域检测是汽车辅助驾驶系统中场景感知的关键基础。基于卷积神经网络的方法因难以获取全局上下文信息而易产生道路空洞和中断等完整性问题,而基于Transformer的方法缺乏局部理解,容易造成边界的错位和越界问题。为了克服上述两类方法的缺陷,提出了一种可学习深度位置编码引导的金字塔Transformer网络架构,融合卷积神经网络与Transformer进行道路可行驶区域检测。该框架建立金字塔Transformer主干网从全局感受野提取道路特征,并结合局部窗口注意力弥补细节损失,以收缩自注意力提升特征计算效率。针对Transformer中传统位置编码忽略像素与实际场景空间关联性的问题,提出用深度图像卷积特征构建可学习位置编码的方法,解决现实关联性脱节引起的注意力偏移和语义不对齐问题。在KITTI道路、Cityscapes与自建厦门市道路数据集上对该方法进行了测试和评估,结果表明,该方法在保证较高效率的同时,具有较高的稳定性和精确性,其最大F值在KITTI和Cityscapes数据集上分别达到97.53%和98.54%,优于目前KITTI道路基准测试的所有方法。此方法可为汽车驾驶辅助系统的路径规划与轨迹预测等任务提供高精度的语义先验信息。

     

    Abstract:
    Objectives The freespace detection is a crucial foundation for scene perception in advanced driver assistance systems. Convolutional neural network-based methods are unable to build global contextual infortmation that generate voids and interruptions in predicted results. At the same time, Transformer-based methods lack local understanding resulting in boundary misalignment and exceed.
    Methods To this end, we propose a pyramid Transformer architecture with learnable deep position encoding for road freespace detection. First, the pyramid Transformer backbone is designed to extract road features from global perspectives. Second, local window attention is employed in dual-Transformer blocks to compensate for detail loss. Finally, to address the problem that traditional unlearnable position encoding ignores the spatial correlation between pixels and the real world, a learnable position encoding from deep convolutional features is constructed to solve the attention and semantic misalignment.
    Results This model is tested and evaluated on KITTI road, Cityscapes, and Xiamen road datasets. The results show that our method achieves maximum F measure of 97.53% and 98.54% in KITTI and Cityscapes, respectively.
    Conclusions Our method outperforms existing algorithms in the KITTI road benchmark by ensuring higher efficiency while providing higher stability and accuracy. Meanwhile, our method provides high-precision semantic prior information for tasks such as path planning and trajectory prediction in automotive driving assistance systems.

     

/

返回文章
返回