Abstract:
Objectives With the advancement of transmission technology, digital devices and display technology, video has rapidly become one of the most popular entertainment media. In order to understand large-scale data such as video with rich and complex semantics, it is necessary to effectively classify human behaviors and events in it. At present, many video classification algorithms need a large number of label samples to train the model in order to obtain higher classification accuracy. However, there are still problems in the acquisition of massive label data: It is difficult to collect a large amount of training data for each category, and it is time-consuming and laborious to annotate these data one by one.
Methods In order to solve the problems of insufficient training samples caused by the high complexity of video classification task and the difficulty in obtaining labeled data, we use the known semantic information of video labels as the knowledge prior, use 3D DenseNet as the backbone network, and adopt the zero-sample learning method to classify video. Using the dense connection unique to the dense network, the depth information can be obtained while ensuring the effective transmission of gradient without generating a large amount of redundancy and resulting in the waste of computation. The original 2D convolution of the dense network can be replaced by 3D convolution, and the spatial-temporal information of the video can be extracted directly.The existing zero-shot video classification method adopts the point-integration classification method, which is easy to make the neuron generate high variance, so that the model is very sensitive to the change of input distribution. To solve this problem, a 3D DenseNet for zero-shot video classification (DZSVC) algorithm is proposed by combining 3D DenseNet and cosine similarity classification method. The cosine similarity classification method is used to replace the point integral classification method, which makes the model more robust to the input amplitude of the system. In addition, based on the dense characteristics of 3D DenseNet, DZSVC algorithm can extract richer temporal and spatial information from video data, and map the extracted feature vectors to the public space.
Results The accuracy rates on UCF101 and HMDB51 are 32.9% and 20.2%, respectively. The accuracy rates on UCF50 and HMDB25 are 41.4% and 23.7%, respectively.
Conclusions The effectiveness of this method is proved through multiple sets of experiments. Compared with the existing zero-shot video classification method, the proposed method can be more effectively applied to video classification.