面向分布式列式存储的轨迹大数据<i>k</i>近邻查询

余列冰; 向隆刚; 孙尚宇; 关雪峰; 吴华意

doi:10.13203/j.whugis20200136

摘要: 针对轨迹大数据的高效点-轨迹k近邻（point to trajectory k nearest neighbor, P2T_kNN）查询处理需求，提出了一种融合时空剖分和轨迹分段的轨迹组织方法，其核心思想是在对轨迹作时间剖分的基础上，利用离散全球网格系统（discrete global grid system, DGGS）在空间上进行再次剖分，从而利用两次剖分得到的时空单元编码来索引落入其中的轨迹片段。在此基础上利用分布式列式存储技术设计了面向轨迹大数据的P2T_kNN查询处理框架，提出了一种顾及轨迹数据空间分布的自适应空间单元搜索算法，即通过分析轨迹数据在给定时间约束下的空间分异特征，动态调整空间单元的搜索步长，从而提升了轨迹稀疏区域的处理效率。针对亿级轨迹的实验结果表明，该方法适用于轨迹大数据的P2T_kNN查询处理，在轨迹稠密与稀疏区域的平均查询响应时间均小于1 s。

Abstract:

Objectives With the advancement of mobile object tracking technologies, huge volumes of spatiotemporal data have been accumulated in the form of location trajectories. Such data bring us new opportunities and challenges in efficient trajectory retrieval. Among them, point to trajectory k nearest neighbor (P2T_kNN) query plays a pivotal role in many mobile object-oriented analyses and applications. It has been widely used in tracing the source of infectious diseases, recommending travel routes, and generating and updating local road networks based on trajectories.

Methods Aiming at the requirement of efficient P2T_kNN query processing for large-scale trajectory data, a trajectory organization method combining spatiotemporal segmentation and trajectory segmentation is proposed. The core idea is to use the discrete global grid system (DGGS) to segment the trajectory in spatial based on the previous time division. The spatiotemporal cell encoding is applied to index the trajectory segments falling in it. In order to realize efficient encoding of spatiotemporal cells, three concatenated encoding structures of spatiotemporal divide and conquer are proposed to solve the huge difference in dimension and scale between temporal dimension and spatial dimension. Based on above, a P2T_kNN query processing framework for distributed column-oriented storage is designed, and an adaptive spatial cell search algorithm that takes into account the spatiotemporal distribution of trajectory data is proposed. The step size for spatial cell searching is adaptively set according to the spatial stratified heterogeneity of the trajectory data under the given time constraint, which greatly improves the search efficiency of areas with sparse trajectory data.

Results Experimental results based on billion-level trajectories show that the method is suitable for P2T_kNN query processing of big trajectory data. (1) Different encoding structures are suitable for different query scenarios. Appropriate encoding structure can be used to make query response time less than 1 s. (2) Adaptive spatial cell search algorithm can significantly improve query efficiency, making the average query response delay of random one thousand queries less than 1 second on both dense and sparse areas. (3) The spatial cell?s k rate has an important influence on the query. After the k rate reaches 0.5, the query response time tends to balance.

Conclutions We propose a nearest neighbor query processing framework and its search algorithm for distributed column-oriented storage. Firstly, the framework organizes the trajectory in segments based on the spatiotemporal cells, and uses concatenated spatiotemporal encoding to realize spatiotemporal cluster storage of trajectory segments.Secondly, an algorithm that can perceive the spatial differentiation characteristics of trajectory data under a given time constraint and dynamically adjust the search step size of the spatial cell is designed, which greatly improves the processing efficiency of the data sparse area. The adaptive search strategy is applicable to any DGGS based on recursive subdivision, and it has strong scalability and can seamlessly connect with other subdivision frameworks, such as S2 Geometry, H3, etc.

面向分布式列式存储的轨迹大数据k近邻查询

kNN Query Processing for Trajectory Big Data Based on Distributed Column-Oriented Storage