Abstract:
Objectives Due to the existence of toponym aliases or negligence in data production, the same geographic entity possibly has different names in maps of different scales which may contain spelling errors or not match the standard toponym. It brings many inconveniences to multi-scale map visualization and geospatial entity extraction. Therefore, it is necessary to study toponym similarity measure to achieve toponym matching. However, the commonly used similarity measures for Chinese toponym matching currently either only consider the number of overlapping characters without considering character position, or only reflect positional feature. Therefore, our objective is to construct a new similarity measure that can simultaneously reflect both features.
Methods By calculating the positional differences between overlapping characters in two Chinese toponyms, we define a total matching offset representing the degree of positional difference of the overlapping character set. Considering that the impact of non-overlapping characters on the similarity of two toponyms should be greater than that of overlapping characters, we define a total non-matching offset. Then, we define the total offset and offset similarity. For the complex case of overlapping character repetition, we determine the minimum offset principle and design an entire sequential matching scheme. For the complex case of character fragment offset, the sum of offsets for each character in the fragment is replaced by the overall offset of the fragment, making the offset value more reasonable. The total offset satisfies positive definiteness and symmetry, but does not satisfy the triangle inequality, and it is more appropriate to use offset similarity to express toponym similarity.
Results Compared with Jaccard similarity and Levenshtein similarity, the results show that the offset similarity can characterize toponym similarity more finely. It attaches greater importance to character differences, and the similarity decreases significantly when the proportion of overlapping characters decreases. It attaches less importance to pure positional differences, but they can also be reflected in slight differences in similarity values. In the toponym matching experiment, the matching accuracy and running time are 63.64% and 2 940.56 s, both of which are better than the Jaccard similarity and Levenshtein similarity.
Conclusions The offset similarity has significant advantages in the toponym matching scenario. But like other string similarity measures, it is difficult to understand semantics. Further optimization can be explored in terms of algorithmic solutions for handling complex cases, applicability to other languages, and consideration of the semantic structure of toponyms.