利用Hilbert曲线与Cassandra技术实现时空大数据存储与索引

Hilbert Curve and Cassandra Based Indexing and Storing Approach for Large-Scale Spatiotemporal Data

  • 摘要: 随着越来越多的轨迹数据被记载,各种应用场景下的海量、复杂数据需要高效的存储与索引。传统的关系型数据库难以满足海量轨迹数据的存储、扩展及特定的查询需求,而具有扩展简单、读写快速、成本低廉特点的非关系型数据库为此提供了一种可行的解决方案。设计并实现了一种基于Cassandra数据库的数据降维及键值存储、索引方法,可对时空轨迹数据进行高效管理。为进一步提高效率,融合了Hilbert曲线编码技术将空间分割成小单元,并将轨迹数据映射到不同单元中。充分利用时空局部性原理,为不同应用场景下的轨迹数据设计并实现了对应的分区键与聚簇键,实现轨迹对象时空近邻存储,令数据查询更为有效。基于实际应用场景的实验结果表明,所提出的方法能有效支撑海量轨迹数据的存储与索引,并在数据的插入、查询及存储结构可扩展性等方面优于其他时空大数据索引和查询方法。

     

    Abstract:
      Objectives  Because of the fast growing acquisition of real-time spatiotemporal data for various applications such as smart city or real-time air-quality monitoring, the traditional database technologies can-not satisfy the higher standards for large-scale data indexing, querying, and storing operations. As the via-ble alternative, NoSQL databases that are scalable and possess fast input/output capabilities offer potential solutions to accommodate the needs.
      Methods  We propose a Hilbert curve and Cassandra technologies based approach for efficient indexing and storing of large-scale spatiotemporal datasets aiming to provide an effective framework for processing, querying, and analyzing large amount of data with spatial and temporal features. For example, the dataset of vehicle trajectories contains valuable spatial and temporal features those are being employed in the real world. The collected spatiotemporal datasets are preprocessed in order to fit the proposed structures for different applications. Specifically, two types of query applications com -monly used in the real world are the spatiotemporal range query and query upon vehicle IDs respectively. Two corresponding indexing structures are designed and implemented in order to accommodate the requests. S2 Geometry Library open sourced by Google is utilized to divide the earth surface into grids, and data points fall in grids are assigned with the specific IDs as the keys. The keys and columns are so designed by applying the Hilbert curve and Cassandra techniques that the resultant structures will physically store the spatially neighboring data points close to each other, and they are more suitable for large-scale spatiotempo-ral data querying and analyzing applications.
      Results  The datasets acquired from the real applications are used to conduct the computational experiments to validate the efficiency of the proposed approach. The que-ry efficiency and the time consumed to store large amount of spatiotemporal data are investigated and bench-marked against some existing database technologies.
      Conclusions  The computational experiments reveal the superiority of the proposed approach comparing to the existing methodologies, the required time to store (insert) data in the database is reduced by 6 times while the time needed to query data is decreased by at least 10 times. The efficiency of the proposed methodology is validate further by applying it to query the vehicle trajectories gathering the real-time air quality data.

     

/

返回文章
返回