We are in an era where the technology is transforming lives, changing the way world works. This tremendous change in technology became possible with the advances in Artificial Intelligence through which we have made the mere computing machines to intelligent ones. Recently human action recognition from videos has got much higher attention from the research community as its fields of application is too vast. Video Analytics, Human Computer Interaction, Robotics, Video Indexing, Visual surveillance are just few of the applications of human action recognition from videos. However action recognition is still a hard problem regarding the selection and availability of realistic data samples. Deep learning has made remarkable success in the field of action recognition espe- cially Convolutional Neural Network (CNN). The problem with CNN is that it requires large amount of training sample to guarantee accurate result, whereas it is hard to get good sample data. Deep Manifold Learning has been introduced as a solution to this problem by modifying the learning process.
There are two types of video features available for action recognition.
Local features include space time interest points, dense trajectories and improved trajectories Methods to deep learned features include RBM,CNN, 3D-CNNs and Deep CNNs The deep learning methods aim to automatically learn the semantic representa- tions from raw videos by using a DNN. Most of the current deep-learning-based action recognition methods ignores the relationship between different training samples. There exist a modified CNN architecture which can extract features from videos. The two-stream CNN matches the state of the art performance of improved trajec- tories. It is composed of two NNs, namely, spatial nets and temporal nets. Spatial nets primarily capture the discriminative appearance features for action understanding, whereas temporal nets aim to learn the effective motion features. However, unlike im- age classification tasks these deep-learning-based methods fail to outperform previous hand- crafted features. Deep learning methods require a large number of labeled videos for training, whereas most available data sets are relatively small. In addition to recog- nition tasks, CNNs have also been used in 3-D image restoration problems. Deep learning methods have achieved success in skeleton-based action recognition, which represents actions by the trajectories of skeleton joints. Temporal representations of body parts could be modelled by bidirectional recurrent NNs. To overcome the van- ishing gradient problem during training, can adopt the long-short term memory neurons in the last BRNN layer.
CNNs belong to the class of biologically inspired models for visual recognition. Mo- tivated by the organization of the visual cortex, a similar model, called HMAX which constructs a hierarchy of increasingly complex features has been developed for visual object recognition. A major difference between CNN and HMAX-based models is that CNNs are fully trainable systems in which all the parameters are adjusted based on training data, whereas all modules in HMAX consist of hard-coded parameters. There is also another deep learning method, that amounts to a type of encoder/decoder archi- tecture, called SESM. Its main motivation is to arrive at a well-principled method for unsupervised training. The RBM method can be viewed as one type of encoder whose elements are stochastic binary variables.
In addition to the CNN approach, there are some other popular designs, such as DBN, DBM, stacked denoising and autoencoders. CNN ArchitecturesFirst task is to incorporating the structural manifold information among different ac- tion videos into a CNN. Any type of CNN model, such as Caf- feNet, ResNet-50, and GoogleNet can be adopted for this manifold learning framework,. Among the re- cently proposed networks that achieved good classification performance on ImageNet, GoogleNet is chosen due to the fact that it has much smaller parameter size.
Suppose that there are N sample action videos with low- level features X = x1, x2,. . . , xNwe construct a CNN for each sample. The convolutional feature map in the (l + 1)thlayer of xi is Ci, which can be described by the neuron activation function as follows: l+1Ci =h Ci) l+1W,b l4 Manifold Regularize. The incorporation of manifold changes the expression of the neuron activation function by adding a regularizer. The added manifold regularizer can be seen as a new defined layer whose input includes hW,b(Cl1),. . . , hW,b(ClN ), label information, and manifold pa- rameters. By this regular-izer, the manifold of layer l is embedded into the outputs of layer l + 1. Normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this problem, a pair of parameters that scale and shift the normalized values are introduced for each activation. Introducing more parame- ters may increase the overfitting risk.
The manifold regularizer does not suffer from this problem, because the manifold is learned based on the label information automatically and does not introduce extra parameters.
Deep networks face the problem of overfitting. Better initialization for supervised learn- ing could solve this problem to some extend. In action recognition, current data sets are 5small and in form of high-dimensional 3-D volumes. The training process of the net- work will be slow without pretraining. Thus, we use unsupervised RBM to pretrain the CNN, and we take the weights of RBM as the initialization weights of CNN. Pretrain- ing consists of learning a stack of RBMs, and the learned hidden units of one RBM are used as the data for training the next RBM in the stack.
This section aims to learn the manifold among different training samples. There have been several popular manifold construction methods from a kernel perspective, includ- ing the isomap method, the graph Laplacian feature mapping method, and the locally linear embedding (LLE) method. The three methods can be viewed as the kernel princi- pal component analysis method based on the special Gram matrix. The isomap method is a widely used low-dimensional embedding method that combines the geometric dis- tance of the weighted graph and classical scale analysis, but it does not consider the generalization ability and topological stability. Here we learn the manifold information using the LLE method. The reason for why we use LLE rather than the autoassociative neural network (i. e. , ANN) is that LLE does not need complex pretraining and train- ing procedures, which allows the manifold to be iteratively updated with the CNN. The manifold needs to be embedded into each layer of the CNN. ANN training needs the CNN’s convolutional feature maps of each layer. The feature maps cannot be obtained before completing CNN training. Thus, ANN training cannot be completed until CNN training is completed.
Conductes experiments on four public large data sets, namely HMDB51,UCF10, KTH, and UCF sports. The HMDB51 data set is a large collection of realistic videos from various sources, including movies and Web videos. The data set is composed of 6766 video clips from 51 action categories, with each category containing at least 100 clips. The experiments follow the original evaluation scheme using three different training/testing splits. In each split, each action class has 70 clips for training and 30 clips for testing. The average accuracy over these three splits is used to measure the final performance. Some of the key challenges are large variations in the camera viewpoint and motion, the cluttered background, and changes in the position, scale, and appearances of the actors. The UCF101 data set contains 101 action classes, and there are at least 100 video clips for each class.
The entire data set contains 13320 video clips, which are divided into 25 groups for each action category. We follow the evaluation scheme of the THU- MOS13 challenge  and adopt the three training/testing splits for evaluation. We choose the training data set of UCF101 split1 for learning DML-based CNN as it is probably the largest public available data set. The KTH action data set is the most commonly used data set in evaluating human action recognition. It consists of 25 subjects performing six actions: walking, jogging, run- ning, boxing, hand waving, and hand clapping under four scenarios (out- doors, outdoors with scale variation, outdoors with different clothes, and indoors). Each sequence is further divided into shorter “clips” for a total of 2391 sequences. We use the original evaluation methodology: assigning eight subjects to a training set, eight to a validation set, and the remaining nine subjects to a test setThe UCF sports data set contains ten human actions: swinging (on the pommel horse and on the floor), diving, kicking (a ball), weight lifting, horse riding, running, skate- boarding, swinging (at the high bar), golf swinging and walking. The data set7consists of 150 video samples that show a large intraclass variability.
The CNN framework. It contains five convolutional layers, two pooling layers, two fully connected layers, and a classifier layer. The size of the input layer is a 224 × 224 × 20 volume. After convolution, the max-pooling operation is introduced to enhance the deformation and the shift invariance. In our implementation,a 3×3 max-pooling op- erator was applied to the first, the second, and the fifth convolution layers. After the first max-pooling layer, the feature map size becomes 72×72. Training a deep CNN on action data sets is more challenging compared with the ImageNet data set. The net- work weights are learned using the mini-batch (set to 256) stochastic gradient descent with momentum (set to 0. 9). has chosen the TVL1 optical flow algorithm and use the Open CV implementation due to its balance between accuracy and efficiency. We train temporal net on UCF101 from scratch. Because the data set is relatively small,used a high dropout ratio to improve the generalization capacity of the trained model. The implementation details are based on those of the original CNN.
To the deep architecture for video recognition,given low- level action features into CNN to learn high-level semantic features.
Because of the new neuron activation function has one additional manifold regulariza- tion term compared with the traditional activation function, the complexity of the reg- ularization term should be considered. Experiment results shows that DML can reduce the computational time by enhancing the convergence speed.
Here Proposed a strategy to improve supervised learning for deep architectures by incor- porating the manifold into the training process. The proposed deep manifold learning is equivalent to adding a manifold regularization term to the original neuron activation function.
This essay has been submitted by a student. This is not an example of the work written by our professional essay writers. You can order our professional work here.