Research on the Method of Constructing General Content Knowledge System for Provincial Comprehensive Atlas of China
-
Abstract
Objectives: As a systematic scientific work, the content structure of an atlas must follow a rigorous logical framework and core themes. Content design, as a core aspect of atlas development, lays the foundation for the scientific nature of the atlas as a whole. Given the lack of a standardized reference for atlas content design, a method is proposed for constructing a general content structure system for provincial comprehensive atlases, integrating Natural Language Processing (NLP) and Knowledge Graph (KG), with the aim of promoting the automation and intelligence of atlas content design. Methods: The construction of a structured database is achieved by the collection and organization of 98 regional comprehensive atlases, both domestically and internationally. Subsequently, a general content structure system was devised for provincial atlases derived from subgroups. The text clustering of subgroups is facilitated by pre-trained language models (PLMs). The utilization of atlas datasets in conjunction with knowledge graph embedding techniques has been demonstrated to enhance the semantic understanding of PLMs. This, in turn, has been shown to result in an improvement in the classification accuracy of thematic subgroups. Results: (1)The construction of a three-tier, hierarchical, standardized content structure system of 'groups – subgroups – maps' was completed. This structure system comprised 4 groups, 55 subgroups and 289 maps. (2)The application of a PLM to the hierarchical clustering of 2,319 atlas subgroup texts has revealed that the clustering metrics exhibit inflection points within the typical range observed for the number of subgroups in the thematic group. (3)The integration of atlas knowledge and KG has been demonstrated to enhance the precision and resilience of PLMs in the context of the clustering task of thematic subgroups. This integration has been shown to result in an enhancement of up to 11.46% in the clustering accuracy (ACC). Conclusions: The establishment of a general content structure system can provide a scientific basis for the content design of provincial atlases, promote the standardization of content and structure, and enhance the completeness and scientific quality of the atlases. A novel methodology is proposed for enhancing PLMs performance by integrating knowledge graph embedding technology. This approach offers a novel framework for fine-tuning PLMs, specifically for text clustering applications, with the objective of enhancing the clustering accuracy.
-
-