融合3D DenseNet的零样本视频分类

尹明臣; 赵晓丽; 郭松; 陈正; 张佳颖

doi:10.13203/j.whugis20200526

摘要: 零样本视频分类（zero-shot video classification，ZSVC）采用的点积分类方式容易使神经元产生较高方差，从而导致模型对输入分布的变化非常敏感。针对该问题，采用三维密集网络（3D DenseNet）与余弦相似性相结合的方法，提出了一种基于3D DenseNet的零样本视频分类（3D DenseNet for zero-shot video classification，DZSVC）算法，通过使用余弦相似性的分类方式替代点积分类方式来控制方差的范围，使模型对不同的输入幅值具有更强的鲁棒性。该算法首先将视频数据输入到3D DenseNet中，利用3D DenseNet的密集特性提取更丰富的时间和空间信息，并将提取出来的特征向量映射到公共空间。采用基于余弦相似性的方法对视频进行分类，在UCF101数据集和HMDB51数据集上的准确率分别为32.9%和20.2%，UCF50数据集和HMDB25数据集上的准确率分别为41.4%和23.7%，实验结果表明所提算法具有良好的分类效果。

Abstract:

Objectives With the advancement of transmission technology, digital devices and display technology, video has rapidly become one of the most popular entertainment media. In order to understand large-scale data such as video with rich and complex semantics, it is necessary to effectively classify human behaviors and events in it. At present, many video classification algorithms need a large number of label samples to train the model in order to obtain higher classification accuracy. However, there are still problems in the acquisition of massive label data: It is difficult to collect a large amount of training data for each category, and it is time-consuming and laborious to annotate these data one by one.

Methods In order to solve the problems of insufficient training samples caused by the high complexity of video classification task and the difficulty in obtaining labeled data, we use the known semantic information of video labels as the knowledge prior, use 3D DenseNet as the backbone network, and adopt the zero-sample learning method to classify video. Using the dense connection unique to the dense network, the depth information can be obtained while ensuring the effective transmission of gradient without generating a large amount of redundancy and resulting in the waste of computation. The original 2D convolution of the dense network can be replaced by 3D convolution, and the spatial-temporal information of the video can be extracted directly.The existing zero-shot video classification method adopts the point-integration classification method, which is easy to make the neuron generate high variance, so that the model is very sensitive to the change of input distribution. To solve this problem, a 3D DenseNet for zero-shot video classification (DZSVC) algorithm is proposed by combining 3D DenseNet and cosine similarity classification method. The cosine similarity classification method is used to replace the point integral classification method, which makes the model more robust to the input amplitude of the system. In addition, based on the dense characteristics of 3D DenseNet, DZSVC algorithm can extract richer temporal and spatial information from video data, and map the extracted feature vectors to the public space.

Results The accuracy rates on UCF101 and HMDB51 are 32.9% and 20.2%, respectively. The accuracy rates on UCF50 and HMDB25 are 41.4% and 23.7%, respectively.

Conclusions The effectiveness of this method is proved through multiple sets of experiments. Compared with the existing zero-shot video classification method, the proposed method can be more effectively applied to video classification.

融合3D DenseNet的零样本视频分类

Zero-Shot Video Classification Combined with 3D DenseNet