Abstract:
Objectives: Spatial join query is one of the basic operations for processing and analyzing spatial data. With the explosive growth of spatial data, join query technology for massive spatial data has attracted much attention. Although the existing research work has explored the join query of spatial data streams, it is still in its infancy. The processing of join query processing of spatial data streams is still insufficient in terms of systematization and universality. It is urgent to explore how to define general Spatial data streams connect query problems and provide systematic solutions.
Method: After fully excavating the common real-time spatial data processing problems behind various practical application scenarios, this paper refers to the processing theory and methods of spatial data in the batch processing field, and considers the characteristics of long-term continuous operation of stream processing, and formalizes the problem of spatial data stream connection Two types of connection operators are defined, including "stream-table" connection and "stream-stream" connection; on this basis, a two-layer data index structure of global grid partition and local spatial index is proposed to support spatial Stream supports distributed connection; for the "stream-table" connection, this paper proposes two physical realization methods of spatial dimension tables, and designs a two-level R-tree-based topology relationship judgment optimization for the memory-based storage method Algorithm; For "stream-to-stream" joins, this paper proposes a partition boundary data redundancy algorithm to correctly implement cross-partition joins of partition boundary data. In addition, for the caching requirements of interval time semantics, a BinR-tree result that takes into account both management and retrieval efficiency is proposed.
Results: (1) The “stream-table” spatial connection implemented in this paper can achieve a throughput of more than 60,000 under a single degree of parallelism, and the average overall delay is about 90 milliseconds. Compared with the native method, the average throughput is 9 times Boost, latency is reduced by an average of 20 milliseconds. (2) Same as the "stream-table" connection, the performance of the "stream-stream" connection method in this paper improves steadily with the increase of parallelism, and has good horizontal scalability. (3) In the experimental comparison between window connection and interval connection, with As the window increases, the connection latency increases, much higher than the stream interval connection. This part of the experiment also shows that the stream interval connection without bounded time semantics is more suitable for stream processing logic. (4) Mesh division has little effect on throughput, and its main function is to balance the load and ensure the spatial proximity of data. (5) When the bin size is smaller than the window size, the query will span multiple BinR-trees, which will affect the query efficiency. Properly increasing the bin size will improve the query efficiency; when the bin size is greater than or equal to the time interval, the query efficiency will not be greatly improved.
Conclusions: A large number of experimental results show that the spatial data connection method proposed in this paper has a good linear speedup ratio, and compared with the baseline method, the connection query efficiency has been significantly improved.