Abstract:

From the past few years, deep learning has evolved as a good performer to find the solutions for many recent problems like speech recognition, human activities recognition and Image classification. In the early days it was hard to process large datasets, and hard to train in the convolutional neural networks without over fitting. Convolutional neural networks involves many more connections other than weight, the architecture realizes a regularization factor. In here, I am going to report about a broad survey of the recent advances in convolutional neural networks. By the end of the report I have also discussed about an application that has be used for human action recognition using CNN.

Introduction:

In this report I would like discuss in detail about the CNN components, mainly convolution layer, pooling layer, activation Function. Additionally, how to deal regularization which works with over fitting of a model. The main purpose of convolution layer is to learn the feature implementation of the given inputs. Basically several convolution kernels are used to compute several feature maps. A feature map consists of several neurons where each neuron is connected to a specific neuron in the previous layer, which is referred as neuron’s receptive field. The feature maps are generated when the kernel is shared with all the spatial locations of the input.

Methodology:

Convolutional layer: The main aim of convolutional layer is to enhance the representation ability. It works well when the instances of latent concepts are linearly separable. Convolutional layer is composed of several convolutional kernels. Below I am discussing a few types of convolutional layers.

1. Tiled Convolution: There can be a drastic decrease in the number of parameters because of the weight sharing mechanism. In order to learn rotational and scale invariant features the tiles Convolution Neural Network multiples the feature maps. The tile size K has the control on the distance on which weights are shared. When the tile size K is 1 it has same weights across each map, tiled Convolution Neural Network resembles traditional neural networks. According to some empirical studies it is proven that when it the tile size is 2 it gives the best performance.

2. Transposed Convolution: Transposed Convolution is also called as deconvolution and it can be seen as the backward pass of traditional convolution. In a contradiction to traditional convolutions where multiple input activations to single activation, Deconvolution tries to associate single activations with multiple output activations. There exists a dilation factor for the input feature map. Deconvolution is widely used for visualization, recognition, semantic segmentation and super resolution etc.,

3. Dilated Convolution: The recent development of Convolution Neural Network that introduces a hyper parameter to the convolutional layer is called dilated convolution. In order to let the network cover more relevant information the dilated convolution increases the receptive field size by inserting zeros between filter elements. This plays an important role in predictions. At 1D, 2D and 3D dilated convolution the dilation factor ‘l’ grows up exponentially at each layer. Dilated Convolution achieved impressive performance at tasks like machine translation, speech synthesis etc.,

4. Network in Network: Network in Network builds a complex structure to get the data in to receptive field and replaces a linear filter by a micro network of the convolutional layer. This micro neural network with multi layer perceptron is considered as a potential function approximator. In order to obtain the feature map sliding the micro network is required, same as the Convolutional Neural Network they are then eligible to be sent into next layer. In order to deal with deep Network in Network the above described structure should be stacked multiple times.

5. Inception Module: Inception Module is considered as logical culmination of Network in Network. In this module to approximate the optimal sparse structure variable filter sizes are used to capture different visual patterns. Inception module consists of three convolutional operations one pooling operation. Inception module decrease computational cost dramatically. In order to maintain optimal performance of the network we should balance number of filters per layer and the depth of the network.

Pooling layer:

Pooling layer is mainly used in order to lower the computational power of the network by reducing the number of connections between the layers. Below I am mentioning the types of pooling used in the most recent times in CNN.

1. Lp Pooling: Lp pooling is considered as the biologically inspired pooling process inspired to work on complex cells. Its theoretically analyzed that the lp pooling provides a generalized results when compared to maximum pooling. When the p value tends to 1 it results in average pooling and when p tends to infinity it results in max pooling.

2. Mixed Pooling: For the name itself it is self explanatory that this pooling works with a mixture of two pooling techniques, they are average pooling and max pooling, which as a random dropout and drop connect process. The mixed pooling function depends on the term called lambda, where lambda can take the values of either 1 or 0 that means either max pooling or average pooling respectively. Since mixed pooling is considered as a hybrid pooling process it works better than max pooling and average pooling. The empirical analysis shows that max pooling is better to be used when working with over fitting problems.

3. Stochastic Pooling: Stochastic pooling is used to deal with dropout inspired process. In this pooling mechanism it mainly deals with the probabilities of each region by normalizing the activations. It randomly picks the activations with multinomial distribution with moderate values of that region unlike the max pooling where only maximum values are considered to be utilized. When compared to max pooling stochastic pooling is better to deal with over fitting problems.

4. Spectral Pooling: Spectral pooling considers the input frequency domain and performs dimensionality reduction. After given an input feature map the dimension of the output is assumed, spectral pooling performs discrete Fourier transform on the input feature map and crops according to the input frequency domain and for backtracking it follows the inverse discrete Fourier transform to map the approximation back into spatial domain. An operation of spectral pooling called Low pass filtering can preserve more information for the same dimensionality reduction when compared to max pooling.

5. Spatial Pyramid Pooling: The spatial pyramid pooling can generate a fixed length representation regardless of the input size. Input feature map of SPP is pooled into local spatial bins with sizes equivalent to the image sizes, resulting in a fixed number of bins. In order to deal with images with different sized we can replace the last pooling layer with SPP.

6. Multi-scale Orderless Pooling : In order to deal with the degrading discriminative power of CNN this pooling came up with an improvement of invariance of the convolutional neural network. The deep activation features for both whole image and local patches for scales are extracted. Main aim of activation features of whole image is to capture the global spatial layout information. Where as the aim of local patches is to aggregate the activation features with VLAD encoding inorder to obtain fine-grained details of the image as well as enhancing invariance. The final output is obtained by the concatenating the global layout and VLAD features.

Activation Function:

An activation function is mainly used in order to significantly improve the performance of the convolutional neural network. Below I would like to discuss the seven most recently used activation functions on Convolutional neural network.

1. ReLU: Rectified Liner Unit is considered as the most notable non saturated activation function. ReLU executes as a elementwise linear function with gradient computation value as 0 or 1 depending on the sign of x. ReLU can improve neural network by speeding up the training process, it can directly work with deep networks and give efficient training so pre-training is not necessary. The ReLU computes much faster than the sigmoid or tanh activation function because of the simple max(.) of ReLU function.

2. Leaky ReLU: The main disadvantage of ReLU over Leaky ReLU is that whenever the units is not active it has zero gradient. This is considered as a potential disadvantage because if the unit is not active initially it can never be assigned weights and can be never active as the gradient based optimization. This constant zero gradient might also result in slowing the training process. In Leaky ReLU function lambda is a predefined parameter in range (0,1). The main advantage of Leaky ReLU over ReLU is it has the capability to compress the negative part rather than mapping it to constant zero.

3. Parametric ReLU: This function is considered as an advanced version of leaky ReLU, in order to increase the accuracy it adaptively learns the parameters of the rectifiers. The count of extra parameters is same as the count of channels so there is no risk of over fitting and there is a negligible computational cost.

4. Randomized ReLU: Another advanced version of Leaky ReLU is randomized ReLU. This process randomly samples the negative parameters from a uniform distribution in training and try to fix it in testing. Due to its randomized nature there are chances to reduce overfitting.

5. ELU: The functioning of exponential linear unit is same as the above mentioned linear units. The major advantage of Exponential Linear Unit is it results in higher classification accuracies by enabling faster learning of deep neural networks which is beneficial for fast learning. In contrast to other linear units ELU have negative part.

6. Maxout: ReLU is a special case of Maxout it has all the features of ReLU. At the maxout layer activation function has maximum outputs. It follows feed forward architecture. It is an alternative non linear function. At each spatial position maxout tries to take maximum response across multiple channels.

7. Probout: A probabilistic variant of max out is called probout. The maximum operation component of max out is replaced with the probabilistic sampling procedure. For controlling the variance of distribution it has a lambda called hyper parameter. Considering the testing process, the additional probability calculations results probout computationally expensive than maxout.

Regularization:

Regularization deals with the over fitting of a model. Over fitting is an unignorable problem in deep convolution neural networks, which can be effectively reduced by regularization. Below I am going to discuss about 3 regularization techniques called lp-norm, dropout, drop connect.

1. Lp-norm: Lp norm regularization modifies the objective function by adding the penalty term to the model in order to reduce the complexity. The parameter called lambda is called regularization strength. When a linear function characterized by a vector and more principles alternative regularization of lp norm is called Tikhonov regularization. This is considered as one of the common forms of regularization and it is also called as ridge regularization. The optimization is easy and renders the function as attractive because lp-norm regularization is convex and the p value is greater than equal to 1. When the p value is less than 1 regularization exploits the scarsity of weights and conducts to non-convex function. When p value is equal to 2 lp norm regularization is referred as weight decay.

2. Dropout: By the empirical analysis by applying dropout to fully connected layers it is proven as the effective process in reducing overfitting. Dropout makes sure to give accurate output even when there is no sufficient information and additionally it confirm that the complete network is not being dependent on a single neuron. The gaussion approximation is sampled from fast dropout method which can perform drop out training. From the survey the drop out probability for hidden variables using binary belief network computes the parameters with deep network. The standard dropout 1*1 convolution layer increase training time and does not consider over fitting.

3. Drop Connect: Drop connect randomly sets the output of neuron to 0 and weight matrix W to 0. The biases are also masked during training. Each unit receives input from random subset of units from previous layer.

Application:

3D CNN application for human action recognition:

The recognition of human activities in the real world has given a large scope for many applications like retrieving shopping behavior in detail, intelligent video surveillance etc., Because of view point variations and cluttered backgrounds it is a highly challenging task to accurately provide the output, this is the reason many of the assumptions are taken into consideration while analyzing a video. Basically it is a two step approach, in the first step we consider the features from raw video frame and in the second step classifiers are learnt based on the derived features. It is basically hard to prioritize features from the real world videos. In this example we consider CNN to explore some human action recognitions with videos taken.

The initial approach in here is to consider video frames as still images and apply CNN to explore and predict the actions at the individual frame level. In order to effectively incorporate the motion information from video analysis we capture the discriminative features along both the spatial and the temporal dimensions to perform 3D convolution in the convolution layer. A 3D convolution neural network architecture is developed based on 3D convolution extractors. This 3D convolution neural network architecture works as follows: In each channel convolution and sub sampling is performed separately to generate multiple channels of information from adjacent video frames. By combining information from all channels the final feature representation is obtained. Then we try to perform regularization on the CNN models by augmenting the models with auxiliary outputs.

3D CNN Architecture in detail:

The CNN architecture is composed as 7 layered process in which there are 3 convolution layer, 2 sub sampling layers, 1 hardwired layer and 1 full connection layers. Initially a hardwired kernel layer is applied on input , 7 frames each of size 60*60 which gives out 33 feature maps with 5 different channels as an output. The five different channels are grey: which consists of gery pixel values of 7 frames, gradient-x and gradient-y: their feature maps are generated by computing along the horizontal and vertical directions of the seven frames and finally optflow-x and optflow-y: these consist the optical flow fields. It is generalized that hardwired leads to better performance compared to random initialization.

We apply 3D convolution with kernel size 7*7*3 on each channel i.e., 5 channels. In order to increase the feature maps we apply two convolutions at each location resulting 2 sets of feature maps with 23 each. Then we apply a 2*2 sub sampling layer in order to decrease the spatial resolution. Following this processes the second 3D convolution with kernel size 7*6*3 is applied on each channel and two different sets of convolutions at each location to produce 2 different sets of feature maps following a 3*3 sub sampling layer to decrease spatial resolution. At this point the size of temporal resolution is relatively small so we finally perform in the spatial dimension, the third 3D convolution with kernel size 7*4 is chosen in order to produce 1*1 feature map. After performing the third 3D convolution layer we finally have 128 feature maps connect to 78 feature maps from the previous layer with size 1*1.

After the application of multiple sub sampling and convolution layers it results a 128D feature vector capturing the motion information in the 7 input frames. At this point for action classification we apply a linear classifier on the 128D feature vector. An extended 3D Convolution neural network architecture is created which can combine multiple channel information at different stages which provides a better results when performed an empirical analysis.

Model Combination:

Based on the operations many types of architectures has been designed, the architecture may vary from one dataset to the other. The selection of architecture is challenging because the specific applications vary from one situation to the other. In order to overcome this there is an alternative approach in which we construct multiple models and combine the output of these models to make predictions. In prediction phase all the models are evaluated and the outputs are aggregated. By performing empirical analysis it is proven that this combinational scheme is very effective in boosting the performance of Convolution Neural Network.

Model Implementation:

When we consider the model implementation of Convolution Neural Network its used in C++ of NEC’s Human action recognition tasks. All sub sampling layers are considered to apply maximum sampling during the analysis. To train the regularized model in a weighted summation the overall loss function is used. For the true action class weight is set to 1 where as for the auxiliary output the weight is set to 0.005. Generally the model parameters are trained using stochastic diagonal Levenberg-Marquardt method in which the diagonal terms of the Gauss-Newton approximation are used compute a leaning rate.

Conclusion:

Deep convolutional neural networks are considered as breakthroughs in image, video, speech and text processing. In this report I have dealt with the main components of Convolutional neural networks like convolutional layers, pooling layer, activation function, regularization factor, optimization and applications of convolutional neural networks for human action recognition and 3D convolutional architecture for human action recognition. Convolutional neural networks has given greater results in empirical analysis but still requires a lot of investigation in the background. Firstly since the usage of CNN is becoming deeper and deeper they require large-scale dataset and massive computing power for training. Although there are many algorithms present we should develop a new scalable parallel training algorithms. Secondly, we need to have some experience in order to select the right suitable hyperparameters such as kernel scale and number of layers etc., Currently many recent works have been establish in order to overcome these issues. Finally I would like to conclude my report saying that there is still a lot to be explored in this stream.

References:

1. Recent Advances in Convolutional Neural Networks

Jiuxiang Gua,_, Zhenhua Wangb,_, Jason Kuenb, Lianyang Mab, Amir Shahroudyb, Bing Shuaib, Ting

Liub, Xingxing Wangb, Gang Wangb aInterdisciplinary Graduate School, Nanyang Technological University, Singapore School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.

2. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

3D Convolutional Neural Networks for Human Action Recognition Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu.