考虑重合字符位次差异的地名相似性度量方法

A Toponym Similarity Measure Method Based on Positional Differences of Overlapping Characters

  • 摘要: 当前中文地名匹配常见的相似性度量或是只考虑重合字符的数量未考虑位次性,或是体现位次性但未考虑重合字符。通过考虑两个中文地名字符串的重合字符对应关系及其位次差距,构建了一种新的距离度量和相似性度量,能够结合重合字符和位次性两个因素计算两个地名的偏距和偏距相似度。针对重合字符复现的情形确定偏距最小原则,设计全体顺次匹配方案;针对字符片段偏移的情形调整距离度量,从而更符合两个地名相似性的直观认知。该距离度量满足正定性和对称性,但不满足三角不等式。与Jaccard系数和编辑距离相似度的测试对比结果表明,所提偏距算法对相似度刻画更为细致,能够检测到重合字符位次差异但更重视未重合字符的差异;在地名匹配实验中匹配正确率和运行时间分别为63.64%和2 940.56 s,两项指标均优于Jaccard系数和编辑距离相似度。

     

    Abstract:
    Objectives Due to the existence of toponym aliases or negligence in data production, the same geographic entity possibly has different names in maps of different scales which may contain spelling errors or not match the standard toponym. It brings many inconveniences to multi-scale map visualization and geospatial entity extraction. Therefore, it is necessary to study toponym similarity measure to achieve toponym matching. However, the commonly used similarity measures for Chinese toponym matching currently either only consider the number of overlapping characters without considering character position, or only reflect positional feature. Therefore, our objective is to construct a new similarity measure that can simultaneously reflect both features.
    Methods By calculating the positional differences between overlapping characters in two Chinese toponyms, we define a total matching offset representing the degree of positional difference of the overlapping character set. Considering that the impact of non-overlapping characters on the similarity of two toponyms should be greater than that of overlapping characters, we define a total non-matching offset. Then, we define the total offset and offset similarity. For the complex case of overlapping character repetition, we determine the minimum offset principle and design an entire sequential matching scheme. For the complex case of character fragment offset, the sum of offsets for each character in the fragment is replaced by the overall offset of the fragment, making the offset value more reasonable. The total offset satisfies positive definiteness and symmetry, but does not satisfy the triangle inequality, and it is more appropriate to use offset similarity to express toponym similarity.
    Results Compared with Jaccard similarity and Levenshtein similarity, the results show that the offset similarity can characterize toponym similarity more finely. It attaches greater importance to character differences, and the similarity decreases significantly when the proportion of overlapping characters decreases. It attaches less importance to pure positional differences, but they can also be reflected in slight differences in similarity values. In the toponym matching experiment, the matching accuracy and running time are 63.64% and 2 940.56 s, both of which are better than the Jaccard similarity and Levenshtein similarity.
    Conclusions The offset similarity has significant advantages in the toponym matching scenario. But like other string similarity measures, it is difficult to understand semantics. Further optimization can be explored in terms of algorithmic solutions for handling complex cases, applicability to other languages, and consideration of the semantic structure of toponyms.

     

/

返回文章
返回