面向大规模空间数据流的分布式连接查询方法

Distributed Join Query Method for Large-Scale Spatial Data Streams

  • 摘要: 空间连接查询是处理和分析空间数据的基础操作之一。随着空间数据的爆发式增长,针对海量空间数据的连接查询技术备受瞩目。面向大规模历史数据的空间连接查询已被广泛研究,而受限于数据流的高速接入率与连接的实时性需求,目前面向数据流的分布式空间连接查询仍充满挑战。为此,本文面向大规模空间数据流设计了一种分布式连接查询处理框架。首先形式化定义了针对空间数据流连接查询处理问题,按照参与连接的数据形态细分为“流-表”和“流-流”两类连接。进一步分析分布式连接场景的共性问题,设计了“全局网格分区-局部空间索引”的两层处理框架,以支持空间流的分布式连接。在此基础上,针对不同连接场景分别设计了适应的连接策略:对于“流-表”连接,提出了基于两级R-tree拓扑关系判断优化算法;对于“流-流”连接,设计了一种顾及分区边界的数据冗余路由算法,以保证分区边界数据的正确连接;此外,针对间隔时间语义的缓存需求,提出了兼顾状态管理与检索效率的BinR-tree结构。大量实验结果表明,本文提出的空间数据连接方法具有良好的线性加速比,且相对于基线方法,连接查询效率得到了显著提升。

     

    Abstract: Objectives: Spatial join query is one of the basic operations for processing and analyzing spatial data. With the explosive growth of spatial data, join query technology for massive spatial data has attracted much attention. Although the existing research work has explored the join query of spatial data streams, it is still in its infancy. The processing of join query processing of spatial data streams is still insufficient in terms of systematization and universality. It is urgent to explore how to define general Spatial data streams connect query problems and provide systematic solutions. Method: After fully excavating the common real-time spatial data processing problems behind various practical application scenarios, this paper refers to the processing theory and methods of spatial data in the batch processing field, and considers the characteristics of long-term continuous operation of stream processing, and formalizes the problem of spatial data stream connection Two types of connection operators are defined, including "stream-table" connection and "stream-stream" connection; on this basis, a two-layer data index structure of global grid partition and local spatial index is proposed to support spatial Stream supports distributed connection; for the "stream-table" connection, this paper proposes two physical realization methods of spatial dimension tables, and designs a two-level R-tree-based topology relationship judgment optimization for the memory-based storage method Algorithm; For "stream-to-stream" joins, this paper proposes a partition boundary data redundancy algorithm to correctly implement cross-partition joins of partition boundary data. In addition, for the caching requirements of interval time semantics, a BinR-tree result that takes into account both management and retrieval efficiency is proposed. Results: (1) The “stream-table” spatial connection implemented in this paper can achieve a throughput of more than 60,000 under a single degree of parallelism, and the average overall delay is about 90 milliseconds. Compared with the native method, the average throughput is 9 times Boost, latency is reduced by an average of 20 milliseconds. (2) Same as the "stream-table" connection, the performance of the "stream-stream" connection method in this paper improves steadily with the increase of parallelism, and has good horizontal scalability. (3) In the experimental comparison between window connection and interval connection, with As the window increases, the connection latency increases, much higher than the stream interval connection. This part of the experiment also shows that the stream interval connection without bounded time semantics is more suitable for stream processing logic. (4) Mesh division has little effect on throughput, and its main function is to balance the load and ensure the spatial proximity of data. (5) When the bin size is smaller than the window size, the query will span multiple BinR-trees, which will affect the query efficiency. Properly increasing the bin size will improve the query efficiency; when the bin size is greater than or equal to the time interval, the query efficiency will not be greatly improved. Conclusions: A large number of experimental results show that the spatial data connection method proposed in this paper has a good linear speedup ratio, and compared with the baseline method, the connection query efficiency has been significantly improved.

     

/

返回文章
返回