Deep Unsupervised Key Frame Extraction for Eficient Video Classification
- 1. ETH
- 2. University of Trento, Italy
- 3. Guangdong University of Petrochemical Technology, China
Description
Video processing and analysis have become an urgent task since a huge amount of videos (e.g., Youtube, Hulu) are uploaded
online every day. The extraction of representative key frames from videos is very important in video processing and analysis
since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video
classiication remains an open problem, as the existing methods have not well balanced the performance and eiciency
simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines
Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC). The proposed TSDPC is a
generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the
number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus it improves
the eiciency of video classiication. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the
CNN to further elevate the performance of classiication. Moreover, a weight fusion strategy of diferent input networks is
presented to boost the performance. By optimizing both video classiication and key frame extraction simultaneously, we
achieve better classiication performance and higher eiciency.We evaluate our method on two popular datasets (i.e., HMDB51
and UCF101) and the experimental results consistently demonstrate that our strategy achieves competitive performance and
eiciency compared with the state-of-the-art approaches.