Abstract:
Objectives With the rapid development of sensors and the Internet of Things technology, a large number and wide range of sensors have been used, which has led to a huge amount of streaming data. These massive, continuous and real-time streaming data bring huge difficulties to data storage, query and analysis, therefore, the problems are urgent to be solved. Streaming data compression is one of the effective methods, and we design a data synopses algorithm with greedy-Haar-synopses (GH-Synopses) to compress streaming data efficiently.
Methods The core of GH-Synopses algorithm is to design a synopses data structure that can process continuous data in real time. The generation process of the compressed data is the same as the process of generating the synopses data structure. Under the prerequisites of the maximum absolute error preset that we set in advance, our algorithm is based on the Haar wavelet transform and uses a greedy strategy that is used to generate as little synopses data as possible to achieve real-time compression of the lossy streaming data.
Results Compared with several existing Haar wavelet-like data synopses algorithms, GH-Synopses algorithm can take into account both compression rate and real-time performance, and can control the error margin of a single data, which can be applied to a wider range of scenarios. The algorithm uses Shenzhen's road link travel speed data from January 5, 2015 to January 30, 2015 to verify from three aspects: Compression quality, execution time efficiency, and reconstruction accuracy. In addition, GH-Synopses is compared with several other more classic algorithms. The results show that, in terms of compression quality, GH-Synopses algorithm generates a small amount of synopses data, and the general compression rate for road link travel speed data can reach about 3% to 7%, so it can be well used for road link travel speed data compression. In terms of execution time efficiency, the running time of GH-Synopses algorithm shows a linear trend with the size of the data set, i.e., the data set increases every 2 times, and the running time also increases about 2 times.
Conclusions GH-Synopses can process the real-time stream data effectively, and the error of each reconstructed data will not exceed the maximum absolute error preset that we set in advance. At the same time, the maximum absolute error preset is closest to the original data when the value is equal to 1 for road link travel speed data. Therefore, GH-Synopses is an effective and efficient compression method for data storage.