VA-HBase: 一种面向矢量数据的自适应分布式管理方案

VA-HBase: An Adaptive Distributed Management Scheme for Vector Data

  • 摘要: 现有的分布式空间数据管理系统侧重于对离散点集、点序列等类型数据的索引与查询,对线、面类对象的支持不足。针对该问题,基于开源HBase数据库,提出了一种面向矢量数据的自适应分布式管理方案(vector-oriented adaptive management scheme based on HBase, VA-HBase)。该方案先对点、线、面等矢量对象采用两级索引结构,主索引自适应剖分对象,寻找其合适存储层级,二级索引在全局粗粒度网格上记录覆盖对象的最小存储层级,然后设计最简字节流方法简化编码长度,优化存储空间,最后基于此设计了高效的范围查询算法。实验结果表明,所提出的VA-HBase方案能有效压缩各类矢量对象的存储空间,查询时能维持稳定的过滤性能,查询效率高于GeoMesa等对比方案约2~10倍,当数据集增大时,VA-HBase显示出良好的扩展性。

     

    Abstract:
    Objectives With the rapid development of Earth observation networks, the size of the accumulated spatial data increases explosively. However, current distributed spatial data management systems focus on discrete point sets (e.g. point of interest) or point sequences (e.g. vehicle trajectory), but they cannot provide sufficient support for complex polyline or polygon objects. To address this problem, we propose a vector-oriented adaptive management method based on HBase, named VA-HBase.
    Methods In this method, a novel two-level spatial index is firstly designed for complex vector objects. The primary index adaptively finds an appropriate storage level for each vector object according to its spatial characteristics, and encodes this object independently with a customed Z-curve encoding schema. This encoding schema interleaves the spatial coordinates into a bit-sequence following the Z-curve, and encodes the derived sequence into a byte code with a proposed simplest byte conversion schema. The secondary index adopts the idea of fixed-level grid partitioning and computes intermediate statistics on storage levels for later efficient spatial query. A middle level is defined for grid generation according the level distribution of stored objects, and the minimum storage level of objects within each grid cell will be recorded. Second, with this two-level spatial index, an HBase storage schema is proposed which includes four tables: One meta-data table, one primary index table, one secondary index table and one raw object table. Finally, we design an efficient range query algorithm based on this method. Integrated with the adaptive-level primary index and the fixed-level secondary index, efficient parallel queries are implemented through HBase's filter mechanism.
    Results Experiments on three real datasets show that: (1) VA-HBase can achieve about 2⁃10 times higher query efficiency compared with GeoMesa and other related methods. (2) For complex polyline or polygon objects, the adaptive indexing of VA-HBase can quickly filter out duplicated or not within the scope of the query rectangle, and the false positive proportion is much lower than other related methods. (3) With the increase of the input data size from 7 GB to 300 GB, the query time cost is kept in about 200 ms and VA-HBase shows very good scalability. (4) Facilitated by the simplest byte encoding schema, the indexing storage space of various vector objects is efficiently compressed.
    Conclusions VA-HBase can well support the complex vector object management in the context of distributed environment, and can maintain efficient and stable query efficiency faced with large-volume datasets.

     

/

返回文章
返回