Abstract:
With the rapid development of the World Wide Web, a huge quantity of geographic information resources are hidden as unstructured texts. Toponym recognition is the foundation of mining the potential geographic information from these texts. In traditional toponym recognition methods based on the natural language processing, the structure of Chinese toponym and features of user customs are ignored, which results in the low recall and precision. In this paper, linguistic knowledge is introduced to analyze Chinese toponym, and the more specific morpheme categories are recognized. Then the process of toponym recognition is transformed into an equivalent sequence labeling problem based on the conditional random field. A proper labeling schema for Chinese toponym is also designed to improve the recognition accuracy. In the experiments, the 1.7 million tagged corpus of The People's Daily are used to test the proposed method. The recall, precision and F value of the result are 92.69%, 96.73% and 94.67% respectively, which are better than other machine learning models. It is proven that the proposed method is effective to recognize Chinese toponym. This research can provide more precise Toponym services for geographic information applications.