Design and Implementation of a Distributed Geospatial Data Storage Structure Based on Spark
-
Graphical Abstract
-
Abstract
In recent years, with the rapid development of sensor web and earth observation technologies, geospatial data has become an important part of the big data, traditional geospatial data storage and processing systems are increasingly unable to meet the requirements of big geospatial data. The Apache Spark, which is a unified analytics engine for large-scale data processing, can provide both the management and processing capabilities of big geospatial data. And based on the Apache Spark, a fundamental platform for developing cloud-based GIS can be developed to move conventional GIS kernel to distributed GIS kernel in the era of cloud computing. On the basis of the data organization and computation models of the Apache Spark system, this paper couples it with the Apache HBase distributed database, and presents the approaches of the design and implementation of a distributed geospatial data storage and processing architecture by leveraging data management and computing paradigm between Apache Spark and Apache HBase. In the architecture, a variable-length GeoHash index method is proposed to improve the query performance of geospatial point, polyline and polygon data, and the SpatialRDD is presented to manage and process the geospatial data queried from the Apache HBase in a distributed manner. The GIS kernel of the architecture is realized based on a Chinese-brand GIS software, in view of the storage and processing of different kinds of geospatial data, such as point, polyline and polygon, a series of contrast experiments with the traditional geospatial database, PostGIS, are performed, and the results demonstrate the applicability and efficiency of the approaches.
-
-