统计决策树下的城市地址集中文分词

Chinese Segmentation of City Address Set Based on the Statistical Decision Tree

摘要: 不同于常规的需要依赖城市地址词典或规则库的地址分词模型，提出不依赖地址词典、基于海量地址数据挖掘的分词方法。该方法结合统计规律计算地址要素在地址数据集中的分布特征，挖掘地址数据中分词的后缀点和落差点，根据后缀点和落差点的相对位置关系构建统计决策树提取地址要素；并采用深圳市建筑物地址普查数据进行验证，形成对当前地址地名词典的有益补充。

Abstract: Different from the conventional address word segmentation model, which relies on the city address dictionary or the rule set, this paper proposes a word segmentation method which does not depend on the address dictionary but based on massive address data mining. This method combines the statistic rules to calculate the distribution of the address elements in the address dataset, excavates the suffix points and the drop points of the address elements in the address data. The method constructs the statistical decision tree based on their relative position relations to extract the address elements, uses the investigation data of building address in Shenzhen to verify and to make a useful supplement to the current gazetteers.