Abstract:
Objectives The content structure of an atlas, as a systematic scientific work, must adhere to a rigorous logical framework and coherent thematic organization. Content design constitutes the foundation of atlas scientific quality. However, the absence of standardized references for atlas content structuring limits consistency and automation. We propose a general content structure system for provincial comprehensive atlases by integrating natural language processing and knowledge graph technologies, aiming to advance intelligent and automated atlas content design.
Methods A structured database was constructed based on the collection and organization of 98 regional comprehensive atlases from domestic and international sources. A general content structure framework for provincial atlases was then derived through subgroup-based analysis. Hierarchical text clustering of atlas subgroup texts was conducted using pre-trained language models (PLM). To enhance semantic representation, atlas domain knowledge was incorporated through knowledge graph embedding techniques, thereby improving the semantic understanding capacity of PLM.
Results A three-tier hierarchical and standardized content structure system, organized as group-ssubgroups-maps was successfully established, comprising 4 groups, 55 subgroups, and 289 map categories. Hierarchical clustering of 2 319 atlas subgroup texts using PLM revealed distinct inflection points in clustering evaluation metrics, corresponding to the typical range of subgroup numbers within thematic groups. Furthermore, the integration of atlas domain knowledge through knowledge graph embedding significantly enhanced clustering robustness and precision, improving overall clustering accuracy by up to 11.46%.
Conclusions The proposed general content structure system provides a scientific foundation for provincial atlas content design, enhances structural standardization, and improves the completeness and scientific rigor of atlas compilation. Furthermore, the integration of knowledge graph embedding offers a novel framework for enhancing PLM performance in domain-specific text clustering tasks, contributing to improved clustering accuracy and intelligent atlas design methodologies.