乐鹏, 吴昭炎, 上官博屹. 基于Spark的分布式空间数据存储结构设计与实现[J]. 武汉大学学报 ( 信息科学版), 2018, 43(12): 2295-2302. DOI: 10.13203/j.whugis20180034
引用本文: 乐鹏, 吴昭炎, 上官博屹. 基于Spark的分布式空间数据存储结构设计与实现[J]. 武汉大学学报 ( 信息科学版), 2018, 43(12): 2295-2302. DOI: 10.13203/j.whugis20180034
YUE Peng, WU Zhaoyan, SHANGGUAN Boyi. Design and Implementation of a Distributed Geospatial Data Storage Structure Based on Spark[J]. Geomatics and Information Science of Wuhan University, 2018, 43(12): 2295-2302. DOI: 10.13203/j.whugis20180034
Citation: YUE Peng, WU Zhaoyan, SHANGGUAN Boyi. Design and Implementation of a Distributed Geospatial Data Storage Structure Based on Spark[J]. Geomatics and Information Science of Wuhan University, 2018, 43(12): 2295-2302. DOI: 10.13203/j.whugis20180034

基于Spark的分布式空间数据存储结构设计与实现

Design and Implementation of a Distributed Geospatial Data Storage Structure Based on Spark

  • 摘要: Apache Spark分布式计算框架可用于空间大数据的管理与计算,为实现云GIS提供基础平台。针对Apache Spark的数据组织与计算模型,结合Apache HBase分布式数据库,从分布式GIS内核的理念出发,设计并实现了分布式空间数据存储结构与对象接口,并基于某国产GIS平台软件内核进行了实现。针对点、线、面数据的存储与查询,与传统空间数据库系统PostGIS进行了一系列对比实验,验证了提出的分布式空间数据存储架构的可行性与高效性。

     

    Abstract: In recent years, with the rapid development of sensor web and earth observation technologies, geospatial data has become an important part of the big data, traditional geospatial data storage and processing systems are increasingly unable to meet the requirements of big geospatial data. The Apache Spark, which is a unified analytics engine for large-scale data processing, can provide both the management and processing capabilities of big geospatial data. And based on the Apache Spark, a fundamental platform for developing cloud-based GIS can be developed to move conventional GIS kernel to distributed GIS kernel in the era of cloud computing. On the basis of the data organization and computation models of the Apache Spark system, this paper couples it with the Apache HBase distributed database, and presents the approaches of the design and implementation of a distributed geospatial data storage and processing architecture by leveraging data management and computing paradigm between Apache Spark and Apache HBase. In the architecture, a variable-length GeoHash index method is proposed to improve the query performance of geospatial point, polyline and polygon data, and the SpatialRDD is presented to manage and process the geospatial data queried from the Apache HBase in a distributed manner. The GIS kernel of the architecture is realized based on a Chinese-brand GIS software, in view of the storage and processing of different kinds of geospatial data, such as point, polyline and polygon, a series of contrast experiments with the traditional geospatial database, PostGIS, are performed, and the results demonstrate the applicability and efficiency of the approaches.

     

/

返回文章
返回