一种Haar小波概要的流数据压缩方法

李水泉; 乐阳; 李清泉; 庄严

doi:10.13203/j.whugis20190309

摘要: 传感器和物联网技术的广泛使用催生了大量流数据。面对海量、连续且实时到达的流数据，其存储、查询以及后续大量数据的连续使用和分析是迫切需要解决的问题, 而流数据压缩是其中一个有效方法。设计了一种数据概要算法，贪心的哈尔小波概要(greedy-Haar-synopses，GH-Synopses)，对流数据进行有效的压缩。首先，设计了一种可以实时处理连续数据的概要数据结构; 然后，在Haar小波变换的基础上用一种贪心策略生成尽可能少的概要数据，从而实现有损的流数据实时压缩。相比于现有的Haar小波类数据概要算法，GH-Synopses能够兼顾压缩率和实时性，并能够控制单个数据的误差界限，可以适用更广泛的场景。利用深圳市路段行驶速度数据进行了验证，结果表明，GH-Synopses算法所生成的概要数据数量少，一般压缩率可达到3%~7%，且能高效重构，重构后的每一个数据误差均不会超过设定的误差预值，是一种有效且高效的流数据压缩方法。

Abstract:

Objectives With the rapid development of sensors and the Internet of Things technology, a large number and wide range of sensors have been used, which has led to a huge amount of streaming data. These massive, continuous and real-time streaming data bring huge difficulties to data storage, query and analysis, therefore, the problems are urgent to be solved. Streaming data compression is one of the effective methods, and we design a data synopses algorithm with greedy-Haar-synopses (GH-Synopses) to compress streaming data efficiently.

Methods The core of GH-Synopses algorithm is to design a synopses data structure that can process continuous data in real time. The generation process of the compressed data is the same as the process of generating the synopses data structure. Under the prerequisites of the maximum absolute error preset that we set in advance, our algorithm is based on the Haar wavelet transform and uses a greedy strategy that is used to generate as little synopses data as possible to achieve real-time compression of the lossy streaming data.

Results Compared with several existing Haar wavelet-like data synopses algorithms, GH-Synopses algorithm can take into account both compression rate and real-time performance, and can control the error margin of a single data, which can be applied to a wider range of scenarios. The algorithm uses Shenzhen's road link travel speed data from January 5, 2015 to January 30, 2015 to verify from three aspects: Compression quality, execution time efficiency, and reconstruction accuracy. In addition, GH-Synopses is compared with several other more classic algorithms. The results show that, in terms of compression quality, GH-Synopses algorithm generates a small amount of synopses data, and the general compression rate for road link travel speed data can reach about 3% to 7%, so it can be well used for road link travel speed data compression. In terms of execution time efficiency, the running time of GH-Synopses algorithm shows a linear trend with the size of the data set, i.e., the data set increases every 2 times, and the running time also increases about 2 times.

Conclusions GH-Synopses can process the real-time stream data effectively, and the error of each reconstructed data will not exceed the maximum absolute error preset that we set in advance. At the same time, the maximum absolute error preset is closest to the original data when the value is equal to 1 for road link travel speed data. Therefore, GH-Synopses is an effective and efficient compression method for data storage.

一种Haar小波概要的流数据压缩方法

A Streaming Data Compression Method Based on Haar Wavelet Synopses