人物经历信息模型及其信息提取方法

Character Life-Track Information Model and Information Extraction Method

  • 摘要: 在当前地理信息系统应用中,人物信息的时空解读非常重要,有助于地理研究者生成多种类型的专题地图,实现相关地理内容的表达。在分析现有人物数据模型特点的基础上,结合地理应用需求和信息提取技术的发展现状,提出了一种突出人物时空特征的经历信息模型。以网络百科数据为例,实现了模型中各要素的提取,有效解决了事件描述识别和位置信息提取两个重点问题。测试和分析结果表明,该事件描述的抽取方法具有较强的实用性,而位置信息提取方法在标注语料有限的情况下,也取得了一定的效果,得出了较好的实验结论。

     

    Abstract:
      Objectives  In the field of human-related geographic information systems (GIS), the spatiotemporal analysis of character information has received increasingly more attention. It is important in that it helps GIS users to generate various thematic maps and achieve the visualization of human geographic content. For adaptation to the development direction of GIS intellectualization, it is of great significance to combine GIS requirements with natural language processing (NLP) methods and build a character information model.
      Methods  Firstly, we expound the research status of character information models in GIS and NLP and put forward the concept of life-track, which is mainly composed of a series of character event mentions. Secondly, considering the feasibility of the existing information extraction methods, a conceptual character life-track information model is determined. This model focuses on event information to highlight character spatiotemporal elements and also includes character attribute and relationship information. Finally, a complete information extraction process is designed for the model with online character encyclopedia pages as the data source. This paper focuses on two sub-tasks in the process: One is to use time features and OpenHowNet semantic calculations to identify event mentions, and the other is to use linguistics features and the conditional random field (CRF) model to extract location information.
      Results  Experiment results show that the method of event mention identification has an accuracy of 91.8%. Although the average F1 value of location information extraction is only 78% under the condition of a limited labeling corpus, some valuable experimental conclusions have been obtained by analyzing the weight of the transmit matrix of the CRF mod‍el: (1) The location phrase and its adjacent words have obvious characteristic effects. (2) ‍The dependency syntactic parsing and the relative position of the word in the sentence have little influence on the extraction. (3) The target of location information extraction is the place where the event occurred, but in a few cases, some location phrases are irrelevant to the location of the event. This is the main reason for the low accuracy.
      Conclusions  Combining GIS with NLP, intelligent GIS development will be prom‍is‍ing. The character life-track information model provides an example of the large-scale use of ubiquitous internet information. Improving methods applied in the extraction process and applying those methods to more online text types are the focus of our team's subsequent research.

     

/

返回文章
返回