张红伟, 杜清运, 陈张建, 张琛. 一种使用RoBERTa-BiLSTM-CRF的中文地址解析方法[J]. 武汉大学学报 ( 信息科学版), 2022, 47(5): 665-672. DOI: 10.13203/j.whugis20210112
引用本文: 张红伟, 杜清运, 陈张建, 张琛. 一种使用RoBERTa-BiLSTM-CRF的中文地址解析方法[J]. 武汉大学学报 ( 信息科学版), 2022, 47(5): 665-672. DOI: 10.13203/j.whugis20210112
ZHANG Hongwei, DU Qingyun, CHEN Zhangjian, ZHANG Chen. A Chinese Address Parsing Method Using RoBERTa-BiLSTM-CRF[J]. Geomatics and Information Science of Wuhan University, 2022, 47(5): 665-672. DOI: 10.13203/j.whugis20210112
Citation: ZHANG Hongwei, DU Qingyun, CHEN Zhangjian, ZHANG Chen. A Chinese Address Parsing Method Using RoBERTa-BiLSTM-CRF[J]. Geomatics and Information Science of Wuhan University, 2022, 47(5): 665-672. DOI: 10.13203/j.whugis20210112

一种使用RoBERTa-BiLSTM-CRF的中文地址解析方法

A Chinese Address Parsing Method Using RoBERTa-BiLSTM-CRF

  • 摘要: 针对当前地址匹配方法严重依赖分词词典、无法有效识别地址中的地址元素及其所属类型的问题,提出了使用深度学习的中文地址解析方法,该方法能够对解析后的地址进行标准化和构成分析以改善地址匹配结果。通过对地址的不同词向量表示及不同序列标注模型的对比评估,结果表明,使用双向门递归单元和双向长短时记忆网络对中文地址解析差别较小,稀疏注意力机制有助于提高地址解析的F1值。所提出的方法在泛化能力测试集上的F1值达到了0.940,在普通测试集上的F1值达到了0.968。

     

    Abstract:
      Objectives  Aiming at the problems that current address matching relies heavily on word segmentation dictionary and cannot effectively recognize address elements in addresses and their types, a Chinese address parsing method based on deep learning is proposed.
      Methods  The model combining robustly optimized bidirectional encoder representations from transformers(BERT) transformers pretraining approach(RoBERTa), bidirectional long short-term memory(BiLSTM), and conditional random field(CRF) is used to parse Chinese addresses. Firstly, the RoBERTa model is used to obtain the word vector representation of the address. Secondly, BiLSTM is used to learn the address model features and contextual information. Finally, CRF is used to construct the constraint relations between the labels.
      Results  Through the comparison and evaluation of different word vector representations of addresses and different sequence labeling models, the proposed method in this study achieves the maximum value of 0.940 on the generalization ability test dataset, and the precision, recall rate, and F1-score of the correspond‍ing test dataset reach 0.962, 0.974, and 0.968.
      Conclusions  The method proposed in this paper does not need to extract the address model features, nor does it rely on word segmentation dictionary for address segmentation. Address elements are recognized by learning address context information and address model features. The model generalization ability test dataset used in this study can effectively test whether the model is overfitted. There is little difference between bidirectional gated recurrent unit (BiGRU) and BiLSTM for Chinese address resolution, and the sparse attention mechanism helps to improve the accuracy of address resolution.

     

/

返回文章
返回