Learning Representative Temporal Features for Action Recognition
In this paper we present a novel video classification methodology that aims to recognize different categories of third-person videos efficiently. The idea is to tracking motion in videos and extracting both short-term and long-term features from motion time series by training a multi-channel one dimensional Convolutional Neural Network (1D-CNN). The positive point about our method is that we only try to learn representative temporal features along the temporal dimension. Spatial features are extracted using pre-trained networks that have already been trained on large scale image recognition datasets. Learning features toward only one dimension reduces the number of calculations significantly and makes our method applicable to even smaller datasets. Furthermore we show that not only our method could reach state-of-the-art results on two public datasets UCF11 and jHMDB, but also we could obtain a strong feature vector representation which in compare with other methods it is much efficient.
READ FULL TEXT