Objectives Aiming at the problems that current address matching relies heavily on word segmentation dictionary and cannot effectively recognize address elements in addresses and their types, a Chinese address parsing method based on deep learning is proposed.
Methods The model combining robustly optimized bidirectional encoder representations from transformers(BERT) transformers pretraining approach(RoBERTa), bidirectional long short-term memory(BiLSTM), and conditional random field(CRF) is used to parse Chinese addresses. Firstly, the RoBERTa model is used to obtain the word vector representation of the address. Secondly, BiLSTM is used to learn the address model features and contextual information. Finally, CRF is used to construct the constraint relations between the labels.
Results Through the comparison and evaluation of different word vector representations of addresses and different sequence labeling models, the proposed method in this study achieves the maximum value of 0.940 on the generalization ability test dataset, and the precision, recall rate, and F1-score of the corresponding test dataset reach 0.962, 0.974, and 0.968.
Conclusions The method proposed in this paper does not need to extract the address model features, nor does it rely on word segmentation dictionary for address segmentation. Address elements are recognized by learning address context information and address model features. The model generalization ability test dataset used in this study can effectively test whether the model is overfitted. There is little difference between bidirectional gated recurrent unit (BiGRU) and BiLSTM for Chinese address resolution, and the sparse attention mechanism helps to improve the accuracy of address resolution.